Thread Local Memory

I have an algorithm that relies on three kinds of memory: global memory, per-block shared memory and per-thread local memory. In evaluating what kind of GPU to buy, In need to know how much memory is available for per-thread local memory, as well as for shared memory and global memory.

Does anyone know where to find this information? I’ve searched through the documentation and am not finding it broken down into per-block memory and per-thread local memory.

Right now I have a GeForce GTX 960. I am looking at what model GPU to buy on other systems. Also, I am trying to figure out whether an array can be kept in thread local memory or if it has to be in per-block shared memory, subscripted by the thread id.

Using granularity, I have some flexibility about the size of these arrays - the larger the better but I can make them a little smaller if the performance trade-off is worth it.


info about memoy sizes is available in cuda manual and wikipedia: see second table at

you need to read cuda manual to get idea how this works. f.e., GeForce 960, as any other sm 5.2 (2nd gen maxwell) devices, has 96 KB of shared memory and 256 KB registers memory per SM. each SM can run up to 2048 threads in up to 32 thread blocks. the more threads you run on each SM, the less registers each thread get (32 registers is a minimum value with a 100% occupancy, each register is 32-bit). the more thread blocks you run on each SM, the less shared memory each block gets (while max. value afair is 48 KB for CUDA and may be 32 KB for other APIs)

finally, thread-local memory aside of registers is just part of global memory, so you may have a lots of it, but it is cached only in L2 cache by default (and anyway L1 cache is only 10-20 KB per SM). and even if you agree to use slower L2 cache for you local data, there are only 128 KB of L2 cache per SM, i.e. about the same amount as fast shared memory

with full occupancy, 96 KB per 2048 threads means 48 bytes per thread, i.e. 12 32-bit values - even less than number of registers. you may reduce occupancy, but it will quickly raise underutilization of ALUs (there are 128 ALUs per SM, so each ALU handles 16 threads when occupancy is 100%. minimum latency of ALU operations is 6 cycles, so you need at least 6 threads just to cover ALU latencies, and even more to cover more complex operations plus memory latencies)

summing things up, except for simplest algorithms like summing arrays, GPU algorithms usually employ threads in the thread block to cooperatively work on ~10-48 KB of data at every moment - that’s the only way to get reasonable efficiency. if you can’t structure your problem in this way - you probably can’t get much of the GPU