Optimization of Computational Kernels Granularity

Optimization of Computational Kernels Granularity

The OpenCL Kernels can be executed on a broad range of hardware devices, providing various level of concurrency, using different caching mechanism and introducing various kernel execution overhead. Thus, searching for optimal granularity of computational kernels is not an easy task. Larger kernels may introduce lower overhead, higher parallelism, and better memory locality, but may also consume more resources such as registers or cache and thus be inefficient.

In this research, we are focused to a kernel fusion method. When the computation is realized by multiple kernels, their fusion may improve data locality, parallelism or serial efficiency. However, it is highly impractical to develop libraries of fused kernels as the number of potential combination of kernels is very high and fusion may decrease performance in some cases. Instead, the programmer may define the computation as a data flow between simple kernels and the source-to-source compiler creates fusion automatically according to data dependencies between kernels and targeted hardware device. We have developed a source-to-source compiler performing fusion on kernels performing (potentially nested) map and reduce operations. Currently, we are extending kernel fusion to more generic kernels.

Results:

Jiří Filipovič, Siegfried Benkner. OpenCL Kernel Fusion for GPU, Xeon Phi and CPU. In Proceedings of IEEE International Symposium on Computer Architecture and High Performance Computing. 2015.
Jiří Filipovič, Matúš Madzin, Jan Fousek, Luděk Matyska. Optimizing CUDA code by kernel fusion: application on BLAS. The Journal of Supercomputing, vol. 71, issue 10, 2015.
Jiří Filipovič, Jan Fousek, Bedřich Lakomý, Matúš Madzin. Automatically Optimized GPU Acceleration of Element Subroutines in Finite Element Method. In Symposium on Application Accelerators in High Performance Computing, 2012.
Jan Fousek, Jiří Filipovič, Matúš Madzin. Automatic Fusions of CUDA-GPU Kernels for Parallel Map. Second International workshop on highly-efficient accelerators and reconfigurable technologies (HEART), 2011.

This work was supported by the project OP RD&E CERIT Scientific Cloud CZ.02.1.01/0.0/0.0/16_013/0001802