Local memory usage

gthazmatt · July 7, 2010, 12:55am

If I have a kernel with 16x16 blocks and 32x32 grids (total of 262,144 threads), and inside that kernel, I declare an array

float array[256];

Assuming it’s put into local memory (which I believe it always should be), how much memory would be allocated? Would it allocate the entire 1K for every thread? And if I wanted to preallocate global memory so I could coalesce access, is there a better way than allocating 256 MB up front?

gthazmatt · July 8, 2010, 7:25am

Let me try asking another question. Is it possible for local memory to be coalesced?

tera · July 8, 2010, 7:57am

(EDIT: I’ll withdraw my reply and wait for someone from Nvidia to comment so that we might get a more authoritative answer)

Gregory_Diamos · July 8, 2010, 6:18pm

I can’t comment on how it is actually done, but there is no reason to reallocate the memory for every CUDA thread. Ideally, only enough memory for every hardware thread should be allocated. When some threads finish and others take their place, they should reuse the existing memory that was allocated for the previous thread.

gthazmatt · July 9, 2010, 5:15am

I don’t see how you could do it like that since there’s no way to index based on it. I’ve tried using just enough for every thread in a block, but of course, that gets overwritten when another block gets swapped on the MP.

To give a little more context of why I’m confused by this, I did exactly what I’m talking about with one array (preallocated 256 MB and indexed by (array of index)*(total num of threads)+thread index) and got about a 10% speed increase (~35 ms to ~32 ms). However, I did the same thing with an array of shorts (so 128 MB), and it slowed down by about 30% (~55 ms). (If I preallocate just the array of shorts and not the array of floats, it’s about ~58 ms.)

tera · July 9, 2010, 8:58am

If you are up to doing something fancy, you can run your own memory allocation scheme on the GPU. If you are on a compute capability 1.3 device, you can read the multiprocessor id from the %smid register in ptx. Unfortunately I don’t know of any way to identify different blocks on the same multiprocessor, so you would still have to run an allocation scheme between them e.g. using atomic bitops.

Topic		Replies	Views
Local memory size CUDA Programming and Performance	8	7719	November 14, 2008
temporary memory issues CUDA Programming and Performance	11	5321	March 30, 2008
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5081	September 6, 2008
questions on register, local memory and block CUDA Programming and Performance	5	4887	February 28, 2008
Where best to allocate memory On the local stack or in shared memory CUDA Programming and Performance	11	5428	January 26, 2009
Local vs Global memory is local memory access always coalesced ? CUDA Programming and Performance	4	4374	June 30, 2009
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	24610	December 21, 2009
Thread Local Memory CUDA Programming and Performance	1	6929	January 26, 2016
Shared memory doubt CUDA Programming and Performance	5	4596	June 11, 2008
Per thread local memory Per thread local memory specified in C Programming Guide CUDA Programming and Performance	1	846	March 6, 2012

Local memory usage

Related topics