Namely, would #define’ing my block dimensions (say, as myBlockDim) and using myBlockDim be faster than directly using blockDimx within a kernel? Or, if I use compile-time constants for the launch configuration for a kernel, will the compiler recognize blockDimx as a compile-time constant? I ask just because being able to use blockDimx, etc., would make kernels a bit more flexible with respect to array-ID functions (if I want to use the same indexing function in kernels with different block dimensions) - but it’s not worth it if it’s slower! (The most recent thread on this that I found is ~8 years old.)
No they are not.
If you inspect the SASS code, you will find that they translate into a load from special register or else a load from constant memory, neither of which is the same thing as a compile time constant. Furthermore, it should be evident from a programming perspective (the underlying CUDA quantities are adjustable at run time, e.g. block dimensions are not necessarily compile time constants).
However you could use compile-time constants in place of these, and it may possibly make things run faster (assuming e.g. your block dimensions are known at compile time). I personally would doubt that this would make a significant performance difference in most CUDA programs, but I’m sure there are pathological cases which can be demonstrated and YMMV.
I’m pretty sure the blockDim and gridDim values are loaded from constant memory. The Idx values are loaded as special registers (S2R instructions), which have about the same latency and throughput as shared memory.
Here is the mapping I use in Maxas:
my %constants = ( blockDimX => 'c[0x0][0x8]', blockDimY => 'c[0x0][0xc]', blockDimZ => 'c[0x0][0x10]', gridDimX => 'c[0x0][0x14]', gridDimY => 'c[0x0][0x18]', gridDimZ => 'c[0x0][0x1c]', );
is that true for any SM generation or just Maxwell/Paswell?
Thanks, I updated my response text to reflect this. It makes sense, these values are constant across the grid.
If you aren’t using too many distinct sizes, consider templating the kernels in question. That provides performance in conjunction with (limited) flexibility.
Thanks for the suggestion; the kernels aren’t as much the issue as global array indexing functions.
Unless I misunderstood, the “global array indexing functions” are used inside the kernels. Is that not the case? If they are used in host code, you can use templates there.
Using template parameters in the indexing arithmetic has all the performance benefits of using compile-time constants, because each such parameter is a constant when the template is instantiated at compile time. At the same time, the code itself retains flexibility because the template parameters are used in the manner of variables, there are no hard-coded constants.
Obviously one would not want to go overboard with template instantiations, but generating several dozen instances seems reasonable to me and is something I have used before. If you need more flexibility than that I am afraid it’s back to blockDim and gridDim. Depending on the code, the effect on performance could be minor (e.g. 2%) or significant (e.g. 20%).
Right, templating would help. Within each kernel I have a true compile-time constant for the dimension, which I could pass as the template parameter within that kernel. Then, if in a different kernel I want to use the same indexing function but with different block dimension, I can just do the same… I like it. Just to make sure - can one #define a dim3 variable as a compile-time constant?
You might also want to consider using constexpr over templates in this instance.