I have a question: When I write a kernel, I use a small array. For example, I declare“int
a; ” in the kernel. I may use this array much. However, if the access pattern of array a is dynamic, which means I access array a with a dynamic index, then the array is stored in local memory rather than registers. If the array a is accessed very frequently, then I may load array a into shared memory. However, I believe shared memory is still slower than registers.
Do we have other options? Is there any research work on this? What causes it difficult to load the small array in the registers for the hardware/software design? Will nvidia optimize this?
Thanks a lot!!