Thread Local Memory

erixoltan · January 26, 2016, 1:17pm

I have an algorithm that relies on three kinds of memory: global memory, per-block shared memory and per-thread local memory. In evaluating what kind of GPU to buy, In need to know how much memory is available for per-thread local memory, as well as for shared memory and global memory.

Does anyone know where to find this information? I’ve searched through the documentation and am not finding it broken down into per-block memory and per-thread local memory.

Right now I have a GeForce GTX 960. I am looking at what model GPU to buy on other systems. Also, I am trying to figure out whether an array can be kept in thread local memory or if it has to be in per-block shared memory, subscripted by the thread id.

Using granularity, I have some flexibility about the size of these arrays - the larger the better but I can make them a little smaller if the performance trade-off is worth it.

Thanks,
Erik

BulatZiganshin · January 26, 2016, 2:48pm

info about memoy sizes is available in cuda manual and wikipedia: see second table at CUDA - Wikipedia

you need to read cuda manual to get idea how this works. f.e., GeForce 960, as any other sm 5.2 (2nd gen maxwell) devices, has 96 KB of shared memory and 256 KB registers memory per SM. each SM can run up to 2048 threads in up to 32 thread blocks. the more threads you run on each SM, the less registers each thread get (32 registers is a minimum value with a 100% occupancy, each register is 32-bit). the more thread blocks you run on each SM, the less shared memory each block gets (while max. value afair is 48 KB for CUDA and may be 32 KB for other APIs)

finally, thread-local memory aside of registers is just part of global memory, so you may have a lots of it, but it is cached only in L2 cache by default (and anyway L1 cache is only 10-20 KB per SM). and even if you agree to use slower L2 cache for you local data, there are only 128 KB of L2 cache per SM, i.e. about the same amount as fast shared memory

with full occupancy, 96 KB per 2048 threads means 48 bytes per thread, i.e. 12 32-bit values - even less than number of registers. you may reduce occupancy, but it will quickly raise underutilization of ALUs (there are 128 ALUs per SM, so each ALU handles 16 threads when occupancy is 100%. minimum latency of ALU operations is 6 cycles, so you need at least 6 threads just to cover ALU latencies, and even more to cover more complex operations plus memory latencies)

summing things up, except for simplest algorithms like summing arrays, GPU algorithms usually employ threads in the thread block to cooperatively work on ~10-48 KB of data at every moment - that’s the only way to get reasonable efficiency. if you can’t structure your problem in this way - you probably can’t get much of the GPU

Topic		Replies	Views
Local Memory Per Thread ? CUDA Programming and Performance	5	4369	June 4, 2010
About the different memories CUDA Programming and Performance	12	11658	December 6, 2007
Per thread local memory Per thread local memory specified in C Programming Guide CUDA Programming and Performance	1	846	March 6, 2012
Local memory size CUDA Programming and Performance	8	7719	November 14, 2008
questions on register, local memory and block CUDA Programming and Performance	5	4887	February 28, 2008
Thread-local memory address CUDA Programming and Performance	5	2358	December 8, 2021
The choose of grid size and block size CUDA Programming and Performance	8	3150	May 8, 2024
Shared Memory Buffer CUDA Programming and Performance	1	2688	May 13, 2011
memory confusion how big is local/shared/global memory? CUDA Programming and Performance	6	3433	January 20, 2009
Question About Memory Hierarchy CUDA Programming and Performance	2	980	August 4, 2010

Thread Local Memory

Related topics