Methods for automatic speedup of OpenCL and CUDA code

Methods for automatic speedup of OpenCL and CUDA code

Research Group Scientific Computing, University of Vienna
The OpenCL is open standard allowing to program many types of accelerators as well as classic CPUs. CUDA is technology developed by NVIDIA, which allows the code to run ant NVIDIA GPUs, x86 CPUs and prospectively also AMD GPUs. Both OpenCL and CUDA allows to implement highly parallel computational kernels executed on computing devices (e.g. accelerators). The development and optimization of kernels is challenging even for experienced programmers, as it requires to efficiently parallelize the code to thousands of independent threads and follow many performance characteristics of current hardware (which changes significantly for different hardware types and even generations). Thus, any automatic tool, which eases the code optimization is of great value.
We are focusing on two main areas in automatic frameworks helping the programmer to write efficient code. First, we develop methods for autotuning the code, where the programmer identifies parameters of the code, which may have influence on the performance, and the autotuning tool automatically search the parameter space and pick ones which leads to the highest performance on particular hardware device. The autotuning decreases the time needed for manual exploration of code tuning parameters and allows developers to write flexible codes, which optimize themselves for underlaying hardware architecture automatically. Second, we develop kernel fusion methods. When the computation is realized by multiple kernels, their fusion may improve data locality, serial efficiency or parallelism degree. However, it is highly impractical to develop libraries of already fused kernels as the number of potential combinations of kernels is very high. Moreover, some fusions may be efficient on one type of device and inefficient on the other. With automatic fusion method, the programmer defines the computation as a data flow between kernels and our source-to-source compiler creates fusion according to data dependencies between kernels and targeted hardware device.

Results:

Jiří Filipovič, Siegfried Benkner. OpenCL Kernel Fusion for GPU, Xeon Phi and CPU. In Proceedings of IEEE International Symposium on Computer Architecture and High Performance Computing. 2015.
Jiří Filipovič, Matúš Madzin, Jan Fousek, Luděk Matyska. Optimizing CUDA code by kernel fusion: application on BLAS. The Journal of Supercomputing, vol. 71, issue 10, 2015.
Jiří Filipovič, Jan Fousek, Bedřich Lakomý, Matúš Madzin. Automatically Optimized GPU Acceleration of Element Subroutines in Finite Element Method. In Symposium on Application Accelerators in High Performance Computing, 2012.
Jan Fousek, Jiří Filipovič, Matúš Madzin. Automatic Fusions of CUDA-GPU Kernels for Parallel Map. Second International workshop on highly-efficient accelerators and reconfigurable technologies (HEART), 2011.

This work is supported by the project OP RD&E CERIT Scientific Cloud CZ.02.1.01/0.0/0.0/16_013/0001802