where the variables will be stored declared inside the kernel

Hi everyone,

If the variable is declared inside the kernel function, where the variable will be stored? in the “register”, “shared memory” or “global memory”. or in some kinds of order?

Taking the following codes for example.

[codebox]global void function_kernel(…)


__shared__ int val1[256];

__shared__ int val2[256];

int val3;

int val4;


Where the variables of “val3” and “val4” will be stored?

thanks in advance.


My guess is a register or local memory.


variables val3, val4 not take alot of memory, so I think they will stored in registers.

Thanks for you and the above’s attention.

If there are 256 threads, the variables will for each threads. That is quite a big number.

I am curious the implementation of threads in one block.

I guess,

One warp implement the same instruction.

When “warp A” needs to access to the global memory, the stream processor will change to implement another warp and the status of current warp will store in the register.

When “warp A” finishes the operation to get the data or store the data, the stream processor will implement it when another warp access to the global memory.

Does anyone can give me a clear thoughts. thanks.


Devices of compute capability 1.0/1.1 have 8192 registers per multiprocessor, while devices of compute capability 1.2 and up have 16384 registers per multiprocessor, this should be enough in most cases, even for fairly complicated kernels.

IMHO the device knows at compile time how much registers each thread in a block requires to execute the block from start to finish. So if you need 20 registers per thread and are running a block of 256 threads, then it will allocate 256*20 registers for the block.
If there are enough registers (and other resources such as shared memory, active threads,etc.) left, then it also allocates the same amount of storage for other blocks which can be executed simultaneously through time slicing.
In other words, the total amount of resources to execute a block is allocated at the start of the block and each thread has has access to it’s own set of 20 registers in this case. When a warp is stalled, it can move on to another warp which operates on its own set of registers.


Thanks, Nico.