Question about variables inside a kernel

Hello all!

I have create the following kernel:

global test(…) {

 unsigned char myVar[512];


In what kind memory is myVar allocated: registers, global, shared, or something else?

Thanks in advance!

If indexing into your array may be precomputed at compile-time it is highly likely that it will be placed in registers. Otherwise it will be placed in local memory (i.e. per-thread part of global memory).

Generally compiler seems to maximize usage of registers where possible (since they’re fast), but registers are not addressable, so it’s impossible to have indexed array in them.

Also, arrays of smilar size may be allocated in local memory because it will requre too much registers otherwise.

Variables without shared specifier are neer allocated in shared memory.

Best way to answer your question is to try and check resulting .cubin file for actual resource (registers, shared and local) usage by kernel.

I had the same doubt, so now I know. The problem I have relating to this, is that I need space (an array of floats, for example) to be used just inside the kernel (no transfers between host and device). I could think of a maximum size and declare it as a local variable (but this would limit input size of the program, etc). If I allocate the memory dynamically outside the kernel, I would be wasting less space. But since I imagine that local memory is optimized somehow (ie: it doesn’t use 512 * total threads, in the previous example, since there aren’t “total threads” executing simultaneously, just the maximum allowed in one warp).
So, any way to allocate memory dynamically outside the kernel call, and specify it will be used as local memory?

Local memory is as slow as global memory, so you should avoid using it where possible. Check if you can benefit from using shared memory in your kernel.
I’m not aware about any ways to declare variable-sized arrays in kernel. You can, however, dynamically allocate device memory from host code and then use it from kernel.

You can only allocate shared memory dynamically outside of your kernel. Also local memory is not optimized as you think since your kernel does not run to completion for 1 warp at a time. All warps are ‘in flight’ on the multiprocessor, so each thread needs it own local memory

Ok, thanks for the information.