Templates

Tigga · January 18, 2009, 9:20pm

This falls slightly into the realm of a C++ question, but with CUDA bits!

I have a kernel which I want to be faster. I have three input arguments which vary between 0-32 depending on earlier calculations and I have an array that is updated by the kernel. If I template one of the input arguments I get signficant speedups. If I were to do the others I may get similar speedups. However… as far as I can tell the C++ language isn’t really flexible enough to let me do this easily - each template argument has to be stated explicitly ie:

if (a == 1 && b == 1) kernel<1, 1>

else if (a == 1 && b == 2) kernel<1, 2>

etc...

If I were to do this for each combination of two variables I would end up with 1024 lines of boring. Three variables would be insane. Now I imagine that three variables might actually cause a slowdown… and would take years to compile… but the sort of speedups which I may get could save days/weeks in the future.

So my quesiton is this: is there any easy way of doing this? Some sort of loop would be lovely, however the compiler doesn’t appear to be clever enough to see that the values in the loop are constant and known at compile time… I contemplated some sort of macro but couldn’t get my head around how I’d do it.

Any ideas?

seibert · January 18, 2009, 9:44pm

I use precisely this trick with a templated kernel that has 3 different parameters. It is a big speedup because it allows loops to be unrolled in the kernel body based on the template parameters, as well as eliminating some dead if-statements where applicable.

Unfortunately, the only suggestion I have for you here is to write a short Perl/Python script to generate your long chain of if-statements in a separate file, then #include them right into your function body at the appropriate location. Then you avoid the error prone task of cutting-and-pasting the selection block into existence.

AndreiB · January 18, 2009, 10:22pm

Here’s an idea which works with Driver API.

Create a templated kernel, explicitly instantiate functions with desired ranges with a dummy function, i.e.

void dummy() {

  for( int a1 = 0; a1 < 32; a1++ )

	for( int a2 = 0; a2 < 32; a2++ )

	  for( int a3 = 0; a3 < 32; a3++ )

		mykernel<a1,a2,a3>(...);

}

Then on the host create and initialize 3-d array which will hold pointers to instantiated functions. Trick here is to get addresses of compiled functions. If you look at nvcc output or into .cubin file, you’ll see that values of a1, a2 and a3 are actually part of mangled function name, so you can fill your array of function pointers for each a1, a2, a3. Now instead of many if’s and switch’es you can get address of required function with one table lookup.

Implemeting this approach with Runtime API is likely possible, but you’ll probably need two levels of templating – first for device functions and second for host-to-device stubs.

Tigga · January 18, 2009, 10:37pm

Here’s an idea which works with Driver API.

Create a templated kernel, explicitly instantiate functions with desired ranges with a dummy function, i.e.
void dummy() {

  for( int a1 = 0; a1 < 32; a1++ )

	for( int a2 = 0; a2 < 32; a2++ )

	  for( int a3 = 0; a3 < 32; a3++ )

		mykernel<a1,a2,a3>(...);

}

My compiler doesn’t accept the above code to initialise a template. It gives:

error: expression must have a constant value

on the last line. It doesn’t seem to be unrolling the loops at compile time. I’m using 2.0 on linux. #pragma unroll doesn’t help at all.

I think I’ve found a way to force unrolling using recursive templates (from here )http://www.codeproject.com/KB/cpp/crc_meta.aspx). This seems to be producing the results that I’m looking for for my one variable test case.

Topic		Replies	Views
template error: expression must have a constant value CUDA Programming and Performance	9	7246	December 2, 2009
Advantage of templated cuda kernel? CUDA Programming and Performance hw , cuda	4	885	July 21, 2024
Passing C++ templates to CUDA How to pass compile-time constants from C++ to CUDA CUDA Programming and Performance	4	3672	June 1, 2009
Problem with static array in templated kernel CUDA Programming and Performance	2	6003	July 9, 2010
Performance slowdown when moving template parameter to function argument CUDA Programming and Performance	21	2375	September 12, 2018
templates and cuda CUDA Programming and Performance	3	3870	August 9, 2011
Create a array in constant memory of template type CUDA Programming and Performance	1	617	May 2, 2018
Template metaprogramming CUDA Programming and Performance	13	2758	October 19, 2010
Pass in C++ template arguments when compiling PTX (or into compiled PTX) CUDA Programming and Performance	9	23382	November 9, 2010
C++ Templates and NVCC's use of registers CUDA Programming and Performance	4	1365	January 2, 2017

Templates

Related topics