CUDA built-in variables vs. constant memory vs. preprocessor macros (i.e. hard-coded values)

Hi,

the scenario: I have a kernel that makes numerous accesses to the gridDim and blockDim built-in variables (as part of many if statements). After that a simple calculation is performed.

I noticed that if I use hard-coded values for block and grid dimensions instead of querying gridDim and blockDim, the execution time improves substantially (especially for large grids). Copying gridDim and blockDim to constant-memory variables before launching the kernel and querying those performs worse than accessing gridDim and blockDim directly.

My questions:
Is querying the built-in variables the fastest way of retrieving block and grid dimensions at runtime?
In what type of memory do the built-in variables reside?

Thanks.

They reside in shared memory, and yes, shared memory is generally the fastest of all available memory spaces.

Are you sure? It doesn’t say anything about them in the programming guide, but the PTX manual (chapter 8) says that there are special built-in registers for accessing these values.

Arguments are in shared, not sure about the dimensions, but indices are definitely in special registers.

You are probably seeing the effects of compile-time constant optimization. When you use the preprocessor to specify the dimensions to your code, the constants can be folded into any calculations you do with them, possibly allowing some register loads or multiplications to be optimized away.

I thought CUDA was supposed to be using some kind of JIT’ing in the compiler now? If that is the case, I would think they would want to optimize for something like this (so you can specify the dimensions at runtime, then it is compiled directly into the code for speed).

That would require re-running the JIT compiler every time you change the dimensions of the kernel, which for some usage patterns could be a drag. Many programs use only one grid configuration ever, but some use a data-dependent block count which could vary every single call.

I might be wrong, but I defenitely read that those are in smem somewhere.

This is one of recent discussions: http://forums.nvidia.com/index.php?showtopic=90377

Thanks for your answers!
My implicit question was actually “Is this reasonable?”, so this thread already helped me a lot.