i am writing a program i which i need to store a lot of data temporarily for each kernel call. in other words, each operating thread needs temporary space which only it needs to access.
i have tried using the third argument in the execution configuration (<<<grid,threads,size>>), but unfortunately that is limited to 16KB per block whereas I need closer to 96 KB. is there any way to increase this limit?
to my knowledge cuda requires that such arrays have a static size at compile time, but my size is only determined by runtime (ie: int num_examples). does this still work? also, if i try to set this to size MAX_EXAMPLES, i get a memory overload error.
No, the variables inside of global and device functions are thread-local and live in registers normally. The compiler will push variables into local memory if there are not enough registers, or if you need to some kind of indexing on an array. (A short array always accessed with constant indices can be stored in registers by the compiler.) Declaring a large array, like “float foo;” inside a global or device function will force the compiler to allocate storage for this array in local memory. A simple variable like “float x;” will almost always be put into a register.
I just point out that local memory allocations are physically stored in global memory so that you are aware of the speed issues. There are no special local memory chips, or anything like that.
As far as I can tell, yes the local array must have a constant size, though I’m not sure about that. There’s a standard mechanism to have a variable-sized shared memory array, but that doesn’t apply to local memory.
that was the first thing that i tried to do, but i could never get it to work properly. the indexing seemed very straightforward to me, but it seems that the order in which the threads/blocks are executed effects what index they are assigned. namely, i had 128 blocks but it seemed like no block index higher than 6 was ever used.
A quick followup here: I wrote a test kernel to see if local memory arrays are laid out in global memory to ensure coalesced reads. I compared a kernel that used a local float in a computation to a similar kernel that instead used part of a large global array as thread-local storage. The throughput was:
Local scratch array: 10.71 GB/sec
Global scratch array (coalesced): 11.69 GB/sec
Global scratch array (not coalesced): 1.35 GB/sec
The coalesced version arranged the scratch array like this:
[Thread 0, local offset 0] [Thread 1, local offset 0] … [Thread 0, local offset 1] [Thread 1, local offset 1] …
The uncoalesced version used the arrangement you might first imagine:
[Thread 0, local offset 0] [Thread 0, local offset 1] … [Thread 1, local offset 0] [Thread 1, local offset 1] …
Anyway, the point is: You can let the compiler handle local array storage if the access pattern is uniform between threads and the array size is fixed. The speed is nearly the same as if you use global memory yourself and manually coalesce the reads.
If you do need a dynamic array, then you can follow DennisR’s suggestion to use global memory. However, his access pattern won’t be coalesced and you’ll get poor performance. You should interleave entries in the pattern shown above, which makes indexing harder, but has much better throughput.