Passing C++ templates to CUDA How to pass compile-time constants from C++ to CUDA

Hi all

For my Master’s I’ve been working on a generic blocked sparse matrix library for the CPU that really performs well compared to scalar sparse matrix libraries, and would like to support offloading of operations to GPU’s supporting CUDA.

My block-dimensions are defined at compile time using templates e.g. Block<M,N,float> for a dense MxN block of floats.

Now, I would like to be able to make a SpMV operation on CUDA that effectively makes use of these dimensions, but as CUDA is called in a different compilation unit, templates cannot be passed to the CUDA function responsible for calling the kernel function, so instead they must be passed via function parameter arguments.

Because functional arguments are only evaluated at run-time this results in several issues, such as defining shared array sizes and unrolling of loops bounded by block-dimensions. For example, my initial design was centered around device functions used by each thread to compute one block-vector multiplication, but they perform horrendously due to unaligned memory access by warps, in code:

// data3 += data1 * data2

// dim(data1) = (rows1, cols1); dim(data2) = (cols1, 1)

__device__ void

bml_prod_bv_d(size_t const rows1,

			  size_t const cols1,

			  float const* const data1, // rows1xcols1 block

			  float const* const data2, // vector

			  float* data3)				  // result


  // nice if I could "cache" data2

  // __shared__ float tmp_data2[cols1];

  size_t b = 0;

  #pragma unroll // no effect since rows1 is not known at compile time

  for (size_t i = 0; i < rows1; ++i, b+= cols1)


	float dot = 0.0f;

	#pragma unroll // no effect since cols1 is not known at compile time

	for (size_t j = 0; j < cols1; ++j)


	  dot += data1[b+j] * data2[j]; // ouch, stride rows1*cols1 when reading data1[]


	data3[i] += dot;



Does anyone know if there exists a way to pass compile-time constants from C++ compilers to nvcc in CUDA in 2.2 or any planned future release?

You can use templates in device code, so you could also define your device method as template<int M, int N, float f>. It should probably even be possible to use those arguments to define the shared segment size, but I haven’t tried that part.

Otherwise you could just specify the shared segment without any size and dynamically allocate the right amount of shared memory on kernel invocation.

Thanks for the quick reply theMatrix

Yes, that is exactly what I did initially, but I cannot pass the template parameters from the C++ host code, which would be the entry-point for users of my library, in code:

extern prod(...);

void foo()


typedef Block<2,3,float> Block_2x3;

Compressed_matrix<Block_2x3> A;

Vector x, y;

// fill A and x

prod(A,x,y); // call CUDA function responsible for mem allocation and kernel invokation


prod() would then call CUDA code in some *.cu file that has been externally compiled by nvcc and would thus be unable to access the block dimensions of Block_2x3, as extern functions cannot be template functions.

I know, that a commonly used work-around is to place a switch in the CUDA extern function that decides the values of the template parameters to pass to the kernel, but in my case the number of possible combinations of block-dimensions is very large (especially, when it comes to sparse-matrix-sparse-matrix operations. But any ideas would be helpful.

Yep, I am planning to do that now, the downside is, that it requires integer arithmetic in order to get the offsets right, but it seems it is the only viable way to go.

But of course the above design is probably too naive, in that each thread ends up with too much work w.r.t. memory access?

Not sure if this would work for you, but I’ve had some classmates who created a wrapper around their application launch and compilation.

The wrapper processes the data files to determine the size of their data, dynamically creates header files with the correct defines and then compiles the CUDA code with NVCC.

It’s definitely not clean, but it works. If the nvcc compilation is quick, there won’t be much noticeable delay at program launch and you have the added convenience of working with #defined data.

Just a thought ^_^

Cute ;-)

I’m afraid, that hacks such as that one is perhaps the only way to do it … it’s just not clean enough to be an option for me. But thanks for the input.

It seems that I will have to find a way to make each warp compute on consecutive data by mapping each thread to one scalar instead of a whole block, and do some kind of gather/reduction. I’ll leave this post open a little while longer, in case someone has some further insights.