PLDI 2024
Mon 24 - Fri 28 June 2024 Copenhagen, Denmark
Wed 26 Jun 2024 16:20 - 16:40 at Sweden - Fast Linear Algebra Chair(s): Zachary Tatlock

Data-parallel computations, such as linear algebra routines (BLAS) and stencil computations, constitute one of the most relevant classes in parallel computing, e.g., due to their importance for deep learning. Efficiently de-composing such computations for the memory and core hierarchies of modern architectures and re-composing the computed intermediate results back to the final result – we say (de/re)-composition for short – is key to achieve high performance for these computations on, e.g., GPU and CPU. Current high-level approaches to generating data-parallel code are often restricted to a particular subclass of data-parallel computations and architectures (e.g., only linear algebra routines on only GPU, or only stencil computations), and/or the approaches rely on a user-guided optimization process for a well-performing (de/re)-composition of computations, which is complex and error prone for the user.

We formally introduce a systematic (de/re)-composition approach, based on the algebraic formalism of Multi-Dimensional Homomorphisms (MDHs) (https://mdh-lang.org). Our approach is designed as general enough to be applicable to a wide range of data-parallel computations and for various kinds of target parallel architectures. To efficiently target the deep and complex memory and core hierarchies of contemporary architectures, we exploit our introduced (de/re)-composition approach for a correct-by-construction, parametrized cache blocking and parallelization strategy. We show that our approach is powerful enough to express, in the same formalism, the (de/re)-composition strategies of different classes of state-of-the-art approaches (scheduling-based, polyhedral, etc), and we demonstrate that the parameters of our strategies enable systematically generating code that can be fully automatically optimized (auto-tuned) for the particular target architecture and characteristics of the input and output data (e.g., their sizes and memory layouts). Particularly, our experiments confirm that via auto-tuning, we achieve higher performance than state-of-the-art approaches, including hand-optimized solutions provided by vendors (such as NVIDIA cuBLAS/cuDNN and Intel oneMKL/oneDNN), on real-world data sets and for a variety of data-parallel computations, including: linear algebra routines, stencil and quantum chemistry computations, data mining algorithms, and computations that recently gained high attention due to their relevance for deep learning.

Wed 26 Jun

Displayed time zone: Windhoek change

16:00 - 17:20
Fast Linear AlgebraPLDI Research Papers at Sweden
Chair(s): Zachary Tatlock University of Washington
16:00
20m
Talk
A Verified Compiler for a Functional Tensor Language
PLDI Research Papers
Amanda Liu Massachusetts Institute of Technology, Gilbert Bernstein University of Washington, Seattle, Adam Chlipala Massachusetts Institute of Technology, Jonathan Ragan-Kelley Massachusetts Institute of Technology
DOI
16:20
20m
Talk
[TOPLAS] (De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional Homomorphism
PLDI Research Papers
Ari Rasch University of Muenster
Link to publication DOI Pre-print Media Attached
16:40
20m
Talk
Compilation of Modular and General Sparse Workspaces
PLDI Research Papers
Genghan Zhang Stanford University, Olivia Hsu Stanford University, Fredrik Kjolstad Stanford University
DOI
17:00
20m
Talk
Descend: A Safe GPU Systems Programming Language
PLDI Research Papers
Bastian Köpcke University of Münster, Sergei Gorlatch University of Muenster, Michel Steuwer Technische Universität Berlin
DOI Pre-print