In my kernel a few threads per warp need to have many more variables than the other threads. These variables won’t fit in shared memory and need to persist the entire lifetime of the thread.
Is there a way to define variables for a subset of my threads in such a way? I need to avoid spilling into local memory by utilizing the few remaining registers available unevenly between threads.
It is difficult to give specific advice when so little is known about a use case. Have you considered storing the data in a texture, so it can benefit from the texture cache? If your GPU platform is sufficiently recent, you could also simply store the data in global memory and load it with __ldg().
I was under the impression that __ldg() was in some way read-only? I am not very familiar with that technique nor surfaces, but these variables need to be frequently read and written to so I would prefer to keep them in registers (shared memory is full).
While I welcome broader discussion and suggestions about CUDA programming techniques, I am really just wondering if the task (use case) that I outlined in my first post is possible:
Can I unevenly distribute registers between threads, so that a subset of threads have access to additional persistent variables without using the same amount of resources for every other thread?
All threads of a kernel use the same number of registers. This is unavoidable as they all execute the same code.
However there are a few things you can do: using the shuffle intrinsics you can use registers of other threads in the warp. Using pointers or array indices in shared memory you can assign arrays of different size to each thread.
At kernel start, the exact same number of registers is allocated to each thread. That is a function of how the GPU hardware implements the SIMT execution model. So you cannot dynamically allocate registers on a per-thread basis.
Best I can see, this leaves two possibilities how to address storage choices:
(1) Check whether the algorithm design can be modified so as not to require different amounts of fast storage per thread.
(2) After registers, the next slower physical memory is shared memory. You state that there is insuffient space left in shared memory. If there isn’t a simple compression available to make the data fit into shared memory, the next slower physical memory is global memory.
When using global memory, some accelerations are possible by utilizing caches. If the amount of data is fairly small, and access is mostly uniform, consider using constant memory (this is a mapped portion of global memory). If the data is large but read-only, check whether __ldg() is applicable. This gives access to the benefits of the texture load path (including texture/read-only cache) without having to set up textures.
csp256: what is the minimum number of bytes required to store the variables in question? this should be a good check to see whether your goal is achievable, regardless of method. space in registers is only that much bigger than shared memory
but the point really: if it becomes clear that global memory is the only option left, would (some/ limited) spilling really be so bad then? i do not see how it can be more expensive than global reads then
“At kernel start, the exact same number of registers is allocated to each thread”
in the case of a kernel containing multiple sections, with each section containing/ utilizing a different number of registers/ local memory, how is the number of registers per thread calculated for allocation - is the register count of the section with the most register utility used?
(then there is also the case of a kernel containing functions, each with different register utilization)