the scenario: I have a kernel that makes numerous accesses to the gridDim and blockDim built-in variables (as part of many if statements). After that a simple calculation is performed.
I noticed that if I use hard-coded values for block and grid dimensions instead of querying gridDim and blockDim, the execution time improves substantially (especially for large grids). Copying gridDim and blockDim to constant-memory variables before launching the kernel and querying those performs worse than accessing gridDim and blockDim directly.
Is querying the built-in variables the fastest way of retrieving block and grid dimensions at runtime?
In what type of memory do the built-in variables reside?
You are probably seeing the effects of compile-time constant optimization. When you use the preprocessor to specify the dimensions to your code, the constants can be folded into any calculations you do with them, possibly allowing some register loads or multiplications to be optimized away.
I thought CUDA was supposed to be using some kind of JIT’ing in the compiler now? If that is the case, I would think they would want to optimize for something like this (so you can specify the dimensions at runtime, then it is compiled directly into the code for speed).
That would require re-running the JIT compiler every time you change the dimensions of the kernel, which for some usage patterns could be a drag. Many programs use only one grid configuration ever, but some use a data-dependent block count which could vary every single call.