temporary memory issues

i am writing a program i which i need to store a lot of data temporarily for each kernel call. in other words, each operating thread needs temporary space which only it needs to access.

i have tried using the third argument in the execution configuration (<<<grid,threads,size>>), but unfortunately that is limited to 16KB per block whereas I need closer to 96 KB. is there any way to increase this limit?

any ideas?

What you describe is the exact definition of local memory. Just declare an array in your global function:

float mydata[100];

And the compiler will put it in local memory, which is just global memory, but the compiler gives each thread gets its own copy.

Keep in mind that local memory is as slow as global memory (since that is where it is stored).

What does this means ,the variables declared in the global function are also stored in the global memory,and if so ,what can local memory do?Am i wrong?

to my knowledge cuda requires that such arrays have a static size at compile time, but my size is only determined by runtime (ie: int num_examples). does this still work? also, if i try to set this to size MAX_EXAMPLES, i get a memory overload error.

No, the variables inside of global and device functions are thread-local and live in registers normally. The compiler will push variables into local memory if there are not enough registers, or if you need to some kind of indexing on an array. (A short array always accessed with constant indices can be stored in registers by the compiler.) Declaring a large array, like “float foo[100];” inside a global or device function will force the compiler to allocate storage for this array in local memory. A simple variable like “float x;” will almost always be put into a register.

I just point out that local memory allocations are physically stored in global memory so that you are aware of the speed issues. There are no special local memory chips, or anything like that.

do these arrays need to be of constant size when they are declared?

As far as I can tell, yes the local array must have a constant size, though I’m not sure about that. There’s a standard mechanism to have a variable-sized shared memory array, but that doesn’t apply to local memory.

You can just allocate global memory and by indexing make sure that each thread accesses its own portion.

that was the first thing that i tried to do, but i could never get it to work properly. the indexing seemed very straightforward to me, but it seems that the order in which the threads/blocks are executed effects what index they are assigned. namely, i had 128 blocks but it seemed like no block index higher than 6 was ever used.

then you must have made an error, can you post the code?

If you need N temporary floats per thread, you could do something like the below in your kernel (completely untested, I am such a noob in C, I always have to check if I have to use & or *):

__global my_kernel(float *global_array, unsigned int N)

float *scratch_mem = &global_array[N*(threadIdx.x + blockIdx.x * blockDim.x)];

And just use scratch_mem[0 till 99] for you thread-local dynamic memory.

A quick followup here: I wrote a test kernel to see if local memory arrays are laid out in global memory to ensure coalesced reads. I compared a kernel that used a local float[100] in a computation to a similar kernel that instead used part of a large global array as thread-local storage. The throughput was:

Local scratch array: 10.71 GB/sec

Global scratch array (coalesced): 11.69 GB/sec

Global scratch array (not coalesced): 1.35 GB/sec

The coalesced version arranged the scratch array like this:

[Thread 0, local offset 0] [Thread 1, local offset 0] … [Thread 0, local offset 1] [Thread 1, local offset 1] …

The uncoalesced version used the arrangement you might first imagine:

[Thread 0, local offset 0] [Thread 0, local offset 1] … [Thread 1, local offset 0] [Thread 1, local offset 1] …

Anyway, the point is: You can let the compiler handle local array storage if the access pattern is uniform between threads and the array size is fixed. The speed is nearly the same as if you use global memory yourself and manually coalesce the reads.

If you do need a dynamic array, then you can follow DennisR’s suggestion to use global memory. However, his access pattern won’t be coalesced and you’ll get poor performance. You should interleave entries in the pattern shown above, which makes indexing harder, but has much better throughput.

Yeah, I was too busy thinking if & gave me the address to think of coalescing. But to get coalesced access, the harder indexing will be more than worthwhile.