Hi everyone,
I have a question: When I write a kernel, I use a small array. For example, I declare“int
a[10]; ” in the kernel. I may use this array much. However, if the access pattern of array a is dynamic, which means I access array a with a dynamic index, then the array is stored in local memory rather than registers. If the array a is accessed very frequently, then I may load array a into shared memory. However, I believe shared memory is still slower than registers.
Do we have other options? Is there any research work on this? What causes it difficult to load the small array in the registers for the hardware/software design? Will nvidia optimize this?
I am not aware of any optimizations the CUDA toolchain is missing in this regard. Generally speaking, registers are not indexable at run time, and this applies to GPUs as well. With optimizations enabled, the CUDA compiler will allocate a small array to registers, if
all indexes can be resolved to compile-time constants
the array is small enough according to an internal heuristic (which may change with architecture and compiler version)
In some cases programmers can assist (1), for example by manually directing the compiler to unroll a loop that the compiler’s unrolling heuristics don’t consider suitable for unrolling.
If access to the small array is mostly uniform across threads in a warp and contains read-only data, consider a constant array. Empirically, “mostly uniform” can be defined as no more than three different addresses being presented across a warp.
For a dynamically indexed small array accessed in read/write fashion, shared memory would be a suitable choice. Pay attention to bank conflicts; check on that with the CUDA profiler. In general, let the CUDA profiler guide optimization efforts. The focus of this question could be a micro optimization or XY problem.