Simple, Portable Parallel C++ with Hemi 2 and CUDA 7.5

Originally published at:

The last two releases of CUDA have added support for the powerful new features of C++. In the post The Power of C++11 in CUDA 7 I discussed the importance of C++11 for parallel programming on GPUs, and in the post New Features in CUDA 7.5 I introduced a new experimental feature in the NVCC CUDA C++ compiler:…

Hi Mark,

Nice article!

Can you compare Hemi and OpenACC? If I want to develop an algorithm, some part might be suitable for my current CPU, other part might be suitable for my current GPU, which tool shall I choose, so I can easily switch one function/algorithm between CPU and GPU to get the best performance?


Thanks! OpenACC is a compiler-based approach to generating parallel code from directives used to annotate loops and data. Hemi is a wrapper API that makes it easy to write custom parallel functions so that they can be compiled either for the host or the device. Hemi is meant to complement CUDA C++, not replace it. So OpenACC is potentially higher level and easier to use with existing code, while Hemi, like CUDA C++ is lower-level and therefore may afford a greater level of control.

If the loops you want to accelerate with GPUs are mostly C, you may have very good luck with OpenACC (especially combined with Unified Memory), because you may be able to let the compiler do more for you.

If you have complex C++ code, you may have better luck with a lower-level approach like Hemi, or using the parallel_for feature in Hemi.

the article has formatting error: "CUDA’s lt;lt;lt; gt;gt;gt; triple-angle-bracket"