Performance, Portability, and Productivity for Data-Parallel Applications on Multi- and Many-Core Architectures
We present a novel approach to performance, portability, and productivity of data-parallel computations on CPUs and GPUs. Our approach is based on Multi-Dimensional Homomorphisms (MDHs) – a formally defined class of functions that cover important data-parallel computations, e.g., linear algebra routines (BLAS), stencil computations, and tensor contractions. For MDHs, we present a high-level Domain-Specific Language (DSL) that contributes to high user productivity, and we propose a corresponding DSL compiler — it automatically generates optimized (auto-tuned) OpenCL code, thereby providing high, portable performance – over different architectures and input sizes – for programs in our DSL. Our experimental results, on Intel CPU and NVIDIA GPU, demonstrate competitive and often significantly better performance of our approach as compared to well-performing state-of-practice approaches, e.g., Intel MKL/MKL-DNN, NVIDIA cuBLAS/cuDNN, and Facebook’s Tensor Comprehensions framework.