Shared and Global Memory connections

Aditi · September 16, 2009, 1:13am

Hi,

I have this CUDA code for GTX 280 with underutilized shared memory resource. The occupancy is limited by the number of registers which is clipped at 20 causing some lmem access as well. There is still scope of using atleast 3-4 times more shared memory per thread for the given block-size for the same occupancy. Hence I tried to offlload some of the register load on to the shared memory.

I made two of the local variables from the kernel into two shared memory arrays of block-size. The kernel now has a combination of dynamically (these two arrays) and statically (some more shared variables declared locally in the kernel) allocated shared memory. It still has the same occupancy and still limited by registers.

I see a drastic reduction in lmem access but absolutely no performance gain which is baffling :o Upon profiling, the only thing that is seen to have increased is Global Memory Access (both loads and stores) (it was understandable even if it were lmem accesses that increased). Use of volatile shared memory variables also did not help.

What is the connection between Global Memory Accesses and Shared Memory? What could be the reason for no performance gain? My code has no scope of reuse of data to benefit from shared memory usage. Offloading some of the register load seems to be the only option of harnessing benefits of shared memory resource.

Will be thankful if someone could please suggest some solution or share some understanding.

Thanks & regards,

Aditi

Tigga · September 16, 2009, 12:08pm

First things first. I personally wouldn’t be worrying too much about occupancy on a GTX 280 until it gets really quite low. Best Practices Guide Section 4.3:

“To hide arithmetic latency completely, multiprocessors should be running at least 192 threads (6 warps). This equates to 25 percent occupancy on devices with compute capability 1.1 and lower, and 18.75 percent occupancy on devices with compute capability 1.2 and higher.”

Constraining your register count brings you up to 0.8 occupancy, whereas I, personally, wouldn’t think about it until it was below 0.5 (and probably not much until it got <0.25). What is the default register count of your kernel? Have you seen that increasing occupancy by an increment which doesn’t dip you into lmem gives you better performance?

Secondly - are you sure you didn’t increase global memory usage by breaking coalescance or something? The change you’ve said you’ve made shouldn’t impact on global memory, only lmem, smem and registers.

Aditi · September 16, 2009, 4:20pm

First things first. I personally wouldn’t be worrying too much about occupancy on a GTX 280 until it gets really quite low. Best Practices Guide Section 4.3:

“To hide arithmetic latency completely, multiprocessors should be running at least 192 threads (6 warps). This equates to 25 percent occupancy on devices with compute capability 1.1 and lower, and 18.75 percent occupancy on devices with compute capability 1.2 and higher.”

Constraining your register count brings you up to 0.8 occupancy, whereas I, personally, wouldn’t think about it until it was below 0.5 (and probably not much until it got <0.25). What is the default register count of your kernel? Have you seen that increasing occupancy by an increment which doesn’t dip you into lmem gives you better performance?

Secondly - are you sure you didn’t increase global memory usage by breaking coalescance or something? The change you’ve said you’ve made shouldn’t impact on global memory, only lmem, smem and registers.

Thanks for your reply. I am not worried about occupancy but am rather using it as a check to see that I do not push the block-size too hard on the resources and mess something up. The entire idea behind offloading register burden onto shared memory (since it was available) was to reduce lmem accesses and improve performance. In past also, I saw improvement with lmem access reduction!!

Now answers to your queries:

The default register count is 24 to 26 registers in 2-kernels that I have in my code. Clipping it to 20 leads to 80+16 bytes of lmem accesses which I guess if reduced should benefit the performance.
I wouldn’t believe so that I did break coalesing by doing so. There are multiple reasons for me to think that way. The transfer is only between registers and shared memory…Global memory doesn’t come into picture anywhere. Every thread is accessing successive 32-bit memory locations…so caolescing and bank conflicts both should be taken care of on GTX 280.
No I do not see any performance benefit anymore by increasing occupancy for obvious reasons as you mentioned. My kernels give me occupancy from 50 to 75% depending on the input size and I think thats the best that occupancy would have contributed to (even in my kind of code which suffers mainly because of large amounts of memory accesses).
I have earlier been benefited by reduced lmem. So I don’t want to believe that the scheduler is very efficient in overcoming the cost of lmem accesses.
I am using CUDA 2.0 profiler and have no way to find out uncoalesced memory accesses. Also, its the increase in coalesced memory accesses that I see.

Please suggest if you can think of anything.

THanks & regards,

Aditi

Topic		Replies	Views
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	24670	December 21, 2009
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5918	July 25, 2007
occupancy and performance also a question about .cubin files CUDA Programming and Performance	6	2217	December 9, 2009
Global memory access cost CUDA Programming and Performance	4	3000	November 18, 2017
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5136	September 6, 2008
shared memory and CUDA calculator CUDA Programming and Performance	6	4046	October 26, 2008
Cuda Occupancy and Register usage CUDA Programming and Performance	6	5846	June 11, 2009
Registers and Shared Memory question CUDA Programming and Performance	7	5461	September 10, 2007
About coalescing CUDA Programming and Performance	6	2635	April 16, 2010
Global memory vs register storage How to force the compiler to use registers? CUDA Programming and Performance	6	4998	July 3, 2009

Shared and Global Memory connections

Related topics