I am noticing that the performance of my cuda program is extremely sensitive to what type of memory is used for each stack variable. In particular, which data gets put into local memory vs. registers seems to be the main factor. My question is, is there any way for me to determine which variables have been put into local memory and also, if there is a way to instruct the compiler to put certain variables into local memory and others into registers.
My guess is that there is currently no way so my feature requests would be
- have a way to determine the type of memory assigned to each variable in device functions, eg. in an annotated output from nvcc
- have a way to specify whether a stack variable should be put into local, register, or shared (this could be a hint like inline)
- make it possible to assign local stack variables to shared memory without explicitly allocating it for all threads in a block. ie. I’d like “int a” rather than “int a[threadBlockSize]”. Maybe this could automatically arrange things in memory for coalescing too.
I’d also like to confirm a theory, and that is how local memory is assigned. My tests seem to show that almost anything that I put into a structure is put into local memory rather than on the stack. It would seem useful to have structured data also put into registers if the structure is small enough.