CUDA built-in variables vs. constant memory vs. preprocessor macros (i.e. hard-coded values)

Golvellius · March 17, 2009, 4:39pm

Hi,

the scenario: I have a kernel that makes numerous accesses to the gridDim and blockDim built-in variables (as part of many if statements). After that a simple calculation is performed.

I noticed that if I use hard-coded values for block and grid dimensions instead of querying gridDim and blockDim, the execution time improves substantially (especially for large grids). Copying gridDim and blockDim to constant-memory variables before launching the kernel and querying those performs worse than accessing gridDim and blockDim directly.

My questions:
Is querying the built-in variables the fastest way of retrieving block and grid dimensions at runtime?
In what type of memory do the built-in variables reside?

Thanks.

AndreiB · March 17, 2009, 8:29pm

They reside in shared memory, and yes, shared memory is generally the fastest of all available memory spaces.

jack · March 17, 2009, 8:50pm

Are you sure? It doesn’t say anything about them in the programming guide, but the PTX manual (chapter 8) says that there are special built-in registers for accessing these values.

tmurray · March 17, 2009, 9:50pm

Arguments are in shared, not sure about the dimensions, but indices are definitely in special registers.

seibert · March 17, 2009, 10:25pm

You are probably seeing the effects of compile-time constant optimization. When you use the preprocessor to specify the dimensions to your code, the constants can be folded into any calculations you do with them, possibly allowing some register loads or multiplications to be optimized away.

jack · March 17, 2009, 11:00pm

I thought CUDA was supposed to be using some kind of JIT’ing in the compiler now? If that is the case, I would think they would want to optimize for something like this (so you can specify the dimensions at runtime, then it is compiled directly into the code for speed).

seibert · March 18, 2009, 12:38am

That would require re-running the JIT compiler every time you change the dimensions of the kernel, which for some usage patterns could be a drag. Many programs use only one grid configuration ever, but some use a data-dependent block count which could vary every single call.

AndreiB · March 18, 2009, 7:02am

I might be wrong, but I defenitely read that those are in smem somewhere.

This is one of recent discussions: http://forums.nvidia.com/index.php?showtopic=90377

Golvellius · March 18, 2009, 5:29pm

Thanks for your answers!
My implicit question was actually “Is this reasonable?”, so this thread already helped me a lot.