I’m having a kernel for a filter code.
I have some querries regarding the memory storage pattern.
I’m posing these querries because memory access has been a major
time seeking part in my code which is turning out to be unaffordable for me…
I’m not allowed to have global memory latency in kernel code.
============================================
Q.1) Where are local variables stored? (declared in kernel code).
In registers or in global memory?
Q.2) Where are local arrays stored? (declared in kernel code).
In registers or in global memory?
Q.3) through .cubin file I know that I’m using high number of
registers and want to lessen that number for getting maximum
occupancy of GPU. (according to cuda occupancy calculator excel
sheet)
I tried that by lessening (i) local variables and (ii) parameters
passed in kernel function call.
None of them worked. How can I lessen number of registers used?
I couldn’t conclude much from generating .cubin file and NVCC_1.0.pdf
Can anyone give any data pointer based on which I can try to find
the solution?
Thanks in advance…