Declaring variables for some threads, but not others

csp256 · November 23, 2014, 3:38am

In my kernel a few threads per warp need to have many more variables than the other threads. These variables won’t fit in shared memory and need to persist the entire lifetime of the thread.

Is there a way to define variables for a subset of my threads in such a way? I need to avoid spilling into local memory by utilizing the few remaining registers available unevenly between threads.

njuffa · November 23, 2014, 6:37am

It is difficult to give specific advice when so little is known about a use case. Have you considered storing the data in a texture, so it can benefit from the texture cache? If your GPU platform is sufficiently recent, you could also simply store the data in global memory and load it with __ldg().

csp256 · November 23, 2014, 5:10pm

I was under the impression that __ldg() was in some way read-only? I am not very familiar with that technique nor surfaces, but these variables need to be frequently read and written to so I would prefer to keep them in registers (shared memory is full).

While I welcome broader discussion and suggestions about CUDA programming techniques, I am really just wondering if the task (use case) that I outlined in my first post is possible:

Can I unevenly distribute registers between threads, so that a subset of threads have access to additional persistent variables without using the same amount of resources for every other thread?

tera · November 23, 2014, 5:28pm

All threads of a kernel use the same number of registers. This is unavoidable as they all execute the same code.

However there are a few things you can do: using the shuffle intrinsics you can use registers of other threads in the warp. Using pointers or array indices in shared memory you can assign arrays of different size to each thread.

njuffa · November 23, 2014, 5:28pm

At kernel start, the exact same number of registers is allocated to each thread. That is a function of how the GPU hardware implements the SIMT execution model. So you cannot dynamically allocate registers on a per-thread basis.

Best I can see, this leaves two possibilities how to address storage choices:

(1) Check whether the algorithm design can be modified so as not to require different amounts of fast storage per thread.

(2) After registers, the next slower physical memory is shared memory. You state that there is insuffient space left in shared memory. If there isn’t a simple compression available to make the data fit into shared memory, the next slower physical memory is global memory.

When using global memory, some accelerations are possible by utilizing caches. If the amount of data is fairly small, and access is mostly uniform, consider using constant memory (this is a mapped portion of global memory). If the data is large but read-only, check whether __ldg() is applicable. This gives access to the benefits of the texture load path (including texture/read-only cache) without having to set up textures.

little_jimmy · November 24, 2014, 6:52am

csp256: what is the minimum number of bytes required to store the variables in question? this should be a good check to see whether your goal is achievable, regardless of method. space in registers is only that much bigger than shared memory

but the point really: if it becomes clear that global memory is the only option left, would (some/ limited) spilling really be so bad then? i do not see how it can be more expensive than global reads then

njuffa:

“At kernel start, the exact same number of registers is allocated to each thread”

in the case of a kernel containing multiple sections, with each section containing/ utilizing a different number of registers/ local memory, how is the number of registers per thread calculated for allocation - is the register count of the section with the most register utility used?

(then there is also the case of a kernel containing functions, each with different register utilization)

Topic		Replies	Views
where the variables will be stored declared inside the kernel CUDA Programming and Performance	5	5003	July 20, 2009
thread private array stored in register or local memory? CUDA Programming and Performance	5	1859	June 21, 2021
Thread and block partition CUDA Programming and Performance	3	4931	December 4, 2007
stupid memory question CUDA Programming and Performance	11	2212	July 28, 2009
Question about variables inside a kernel CUDA Programming and Performance	5	2435	January 22, 2008
Register Indexing CUDA Programming and Performance	4	1282	March 6, 2011
Variable in Kernel CUDA Programming and Performance	3	3985	November 18, 2011
Local variables and registers CUDA Programming and Performance	13	6342	March 23, 2010
Built-in Variables Memory Location ? in which memory are built in variables stored CUDA Programming and Performance	3	5883	September 9, 2011
Force a variable to be stored in a Register Is there any way to ensure a variable CUDA Programming and Performance	13	9183	May 21, 2010

Declaring variables for some threads, but not others

Related topics