Shared and Global Memory connections

Hi,

I have this CUDA code for GTX 280 with underutilized shared memory resource. The occupancy is limited by the number of registers which is clipped at 20 causing some lmem access as well. There is still scope of using atleast 3-4 times more shared memory per thread for the given block-size for the same occupancy. Hence I tried to offlload some of the register load on to the shared memory.

I made two of the local variables from the kernel into two shared memory arrays of block-size. The kernel now has a combination of dynamically (these two arrays) and statically (some more shared variables declared locally in the kernel) allocated shared memory. It still has the same occupancy and still limited by registers.

I see a drastic reduction in lmem access but absolutely no performance gain which is baffling :o Upon profiling, the only thing that is seen to have increased is Global Memory Access (both loads and stores) (it was understandable even if it were lmem accesses that increased). Use of volatile shared memory variables also did not help.

What is the connection between Global Memory Accesses and Shared Memory? What could be the reason for no performance gain? My code has no scope of reuse of data to benefit from shared memory usage. Offloading some of the register load seems to be the only option of harnessing benefits of shared memory resource.

Will be thankful if someone could please suggest some solution or share some understanding.

Thanks & regards,

Aditi

First things first. I personally wouldn’t be worrying too much about occupancy on a GTX 280 until it gets really quite low. Best Practices Guide Section 4.3:

“To hide arithmetic latency completely, multiprocessors should be running at least 192 threads (6 warps). This equates to 25 percent occupancy on devices with compute capability 1.1 and lower, and 18.75 percent occupancy on devices with compute capability 1.2 and higher.”

Constraining your register count brings you up to 0.8 occupancy, whereas I, personally, wouldn’t think about it until it was below 0.5 (and probably not much until it got <0.25). What is the default register count of your kernel? Have you seen that increasing occupancy by an increment which doesn’t dip you into lmem gives you better performance?

Secondly - are you sure you didn’t increase global memory usage by breaking coalescance or something? The change you’ve said you’ve made shouldn’t impact on global memory, only lmem, smem and registers.

Thanks for your reply. I am not worried about occupancy but am rather using it as a check to see that I do not push the block-size too hard on the resources and mess something up. The entire idea behind offloading register burden onto shared memory (since it was available) was to reduce lmem accesses and improve performance. In past also, I saw improvement with lmem access reduction!!

Now answers to your queries:

  1. The default register count is 24 to 26 registers in 2-kernels that I have in my code. Clipping it to 20 leads to 80+16 bytes of lmem accesses which I guess if reduced should benefit the performance.

  2. I wouldn’t believe so that I did break coalesing by doing so. There are multiple reasons for me to think that way. The transfer is only between registers and shared memory…Global memory doesn’t come into picture anywhere. Every thread is accessing successive 32-bit memory locations…so caolescing and bank conflicts both should be taken care of on GTX 280.

  3. No I do not see any performance benefit anymore by increasing occupancy for obvious reasons as you mentioned. My kernels give me occupancy from 50 to 75% depending on the input size and I think thats the best that occupancy would have contributed to (even in my kind of code which suffers mainly because of large amounts of memory accesses).

  4. I have earlier been benefited by reduced lmem. So I don’t want to believe that the scheduler is very efficient in overcoming the cost of lmem accesses.

  5. I am using CUDA 2.0 profiler and have no way to find out uncoalesced memory accesses. Also, its the increase in coalesced memory accesses that I see.

Please suggest if you can think of anything.

THanks & regards,

Aditi