I have this CUDA code for GTX 280 with underutilized shared memory resource. The occupancy is limited by the number of registers which is clipped at 20 causing some lmem access as well. There is still scope of using atleast 3-4 times more shared memory per thread for the given block-size for the same occupancy. Hence I tried to offlload some of the register load on to the shared memory.
I made two of the local variables from the kernel into two shared memory arrays of block-size. The kernel now has a combination of dynamically (these two arrays) and statically (some more shared variables declared locally in the kernel) allocated shared memory. It still has the same occupancy and still limited by registers.
I see a drastic reduction in lmem access but absolutely no performance gain which is baffling :o Upon profiling, the only thing that is seen to have increased is Global Memory Access (both loads and stores) (it was understandable even if it were lmem accesses that increased). Use of volatile shared memory variables also did not help.
What is the connection between Global Memory Accesses and Shared Memory? What could be the reason for no performance gain? My code has no scope of reuse of data to benefit from shared memory usage. Offloading some of the register load seems to be the only option of harnessing benefits of shared memory resource.
Will be thankful if someone could please suggest some solution or share some understanding.
Thanks & regards,