Where are variables stored in device memory? Lessen the device register count

I’m having a kernel for a filter code.

I have some querries regarding the memory storage pattern.

I’m posing these querries because memory access has been a major
time seeking part in my code which is turning out to be unaffordable for me…

I’m not allowed to have global memory latency in kernel code.

Q.1) Where are local variables stored? (declared in kernel code).
In registers or in global memory?

Q.2) Where are local arrays stored? (declared in kernel code).
In registers or in global memory?

Q.3) through .cubin file I know that I’m using high number of
registers and want to lessen that number for getting maximum
occupancy of GPU. (according to cuda occupancy calculator excel
I tried that by lessening (i) local variables and (ii) parameters
passed in kernel function call.

None of them worked. How can I lessen number of registers used?

I couldn’t conclude much from generating .cubin file and NVCC_1.0.pdf

Can anyone give any data pointer based on which I can try to find
the solution?

Thanks in advance…

1,2: NVCC tries to place variables in registers. Small arrays are also placed in registers if you use only fixed indexes into this array. If you use variable indexing then array will be placed in local memory (which is as slow as global memory).

3: It is not trivial to reduce number of registers by altering source code. NVCC performs really aggressive optimization. However, you can use -maxrregcount switch with nvcc to limit max. number of register. This may cause some variables to be placed in local memory and make your kernel slower. It is also known that loop unrolling increases register usage, so if you have loops you may wish to consider keeping loop not unrolled. There is also thread on this forum with some tips for reducing number of registers used.