C++11 in CUDA: Variadic Templates

Originally published at: https://developer.nvidia.com/blog/cplusplus-11-in-cuda-variadic-templates/

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with nvcc, but also in device code. In my post “The Power of C++11 in CUDA 7” I covered some of the major new features of C++11, such…

How about OpenCL 2.1 support ?

Useful examples Mark! I had written a C pre-processor variadic macro for something similar as your first cudaLaunch example. It works fine. Variadic Kernels look very useful. I believe this could help simplify heterogenous programming and portability. Let's read quite a few more times your posting ... It seems the clearest exposition about it ... One question: is the CUDA toolkit updated with what you tell us? More examples? After a quick search, I only found E.2.9.8. __global__ functions and function templates and PTX ISA stuff.

"is the CUDA toolkit updated with what you tell us"? Not sure what you mean. As the post says this is supported in CUDA 7, and the documentation link in the post links to the restrictions on variadic __global__ function template parameters. So I think the answer is "yes".

I would like to also mention perfect forwarding of arguments, as no discussion of function templates in C++11 is complete without it: http://eli.thegreenplace.ne...

How does the compiler resolve lambda functions in CUDA 7.0? Are they inline or is it really a function call?
I need this information in order to estimate the performance. When all lambda functions are resolved as real function calls then this will affect very negative the performance.

Hi Peter,
The compiler uses the same strategy to inline lambdas that it uses for any other function call. Where the lambda is called directly the inliner will try to inline as usual. But if the lambda is stored into an instance of nvstd::function (from the "nvfunctional" header), and the compiler is unable to figure out the underlying function at the call site, it will not be inlined.