Registers and Shared Memory

lsolano · June 15, 2010, 8:05am

Hi there,

I have a kernel that reads data from global memory, do some calculations and copy data back to the global memory. For this kernel I wrote two versions (both versions have 256 threads per block):
First version: every thread utilizes 8 integer variables and 7 float variables (all these variables are mapped to registers). In these case I’m using roughly 16K registers per block and no shared memory.
Second version: every thread utilizes 8 integer variables and the float variables are mapped to shared memory. In these case I am utilizing 8K registers per block and roughly 8K shared memory per block.

My GPU is a Tesla T10 with 16K registers and 16K shared memory.

When I time both versions, the first version is faster than the second one. I understand that the access time to shared memory can be slightly higher than the access time to registers, however, in the second version I can have 2 blocks assigned to a SM whereas in the first version I have only one block assigned to the SM.

Question is: Due to the fact that I can accommodate 2 blocks into a SM, second version should be faster, however the timings show this is not true. What is the explanation ?

Thanks,

eyalhir74 · June 15, 2010, 8:20am

Hi there,

I have a kernel that reads data from global memory, do some calculations and copy data back to the global memory. For this kernel I wrote two versions (both versions have 256 threads per block):

First version: every thread utilizes 8 integer variables and 7 float variables (all these variables are mapped to registers). In these case I’m using roughly 16K registers per block and no shared memory.

Second version: every thread utilizes 8 integer variables and the float variables are mapped to shared memory. In these case I am utilizing 8K registers per block and roughly 8K shared memory per block.

My GPU is a Tesla T10 with 16K registers and 16K shared memory.

When I time both versions, the first version is faster than the second one. I understand that the access time to shared memory can be slightly higher than the access time to registers, however, in the second version I can have 2 blocks assigned to a SM whereas in the first version I have only one block assigned to the SM.

Question is: Due to the fact that I can accommodate 2 blocks into a SM, second version should be faster, however the timings show this is not true. What is the explanation ?

Thanks,

See what happens in the Occupancy calculator:

1st case: 256 threads, 15 registers, 0 smem - 100%

2nd case: 256 threads, 8 registers, 8000 smem - 50%

Probably explains what you see :)

eyal

laughingrice · June 15, 2010, 10:22am

Hi there,

I have a kernel that reads data from global memory, do some calculations and copy data back to the global memory. For this kernel I wrote two versions (both versions have 256 threads per block):

First version: every thread utilizes 8 integer variables and 7 float variables (all these variables are mapped to registers). In these case I’m using roughly 16K registers per block and no shared memory.

Second version: every thread utilizes 8 integer variables and the float variables are mapped to shared memory. In these case I am utilizing 8K registers per block and roughly 8K shared memory per block.

My GPU is a Tesla T10 with 16K registers and 16K shared memory.

When I time both versions, the first version is faster than the second one. I understand that the access time to shared memory can be slightly higher than the access time to registers, however, in the second version I can have 2 blocks assigned to a SM whereas in the first version I have only one block assigned to the SM.

Question is: Due to the fact that I can accommodate 2 blocks into a SM, second version should be faster, however the timings show this is not true. What is the explanation ?

Thanks,

First of all, you can check occupancy using the profiler, but from some point (~192 threads) you are not always going to see much difference due to higher occupancy (192 threads are enough to hide required register delay of read after write), it depends on how much delay there is to hide.

It depends on your access patterns but you may be seeing bank conflicts. Try putting the data in columns in shared memory instead of rows (i.e put the first variable for all threads concurrently then the second, which will come out to a stride of 256 between variables for each thread). This should make sure that all data for each thread is sitting in a single bank and would remove bank conflicts

eyalhir74 · June 15, 2010, 10:25am

Bank conflicts in shared memory is probably the last thing to check with regard to performance issues.

tera · June 15, 2010, 11:16am

In the first case you are only using 15*256=3840 registers. There are 16K word-wide registers equating to 64KB of register memory, but only 16KB of shared memory.

laughingrice · June 15, 2010, 1:38pm

well then, I guess we do different things, but it is usually the most significant performance issue I see in my code apart for global memory access pattern

lsolano · June 15, 2010, 3:30pm

Thanks Tera, so each register is 4-bytes wide ? if so, an integer and float number fits in one register, right ?

lsolano · June 15, 2010, 4:05pm

Thanks Eyal, Yes, it explains what I see. I was also confused thinking that I needed 4 registers to store a float/int value which is wrong, that’s why I said I was using roughly 16K registers (4 bytes x 16 regs x 256 threads), when I was using only 4K registerss (16 regs x 256 threads)

lsolano · June 15, 2010, 4:07pm

Thanks, in my case the data in shared memory do not have any particular order, caz it is only to store temporary values. I only mapped them to shared memory to use less registers.

BlahCuda · June 15, 2010, 4:18pm

On a somewhat related note, in general how much faster is access to registers compared to shared memory for Fermis and non-Fermis?

tera · June 15, 2010, 4:25pm

Yes.

Which likely means there will be no bank conflicts: If you turn 4-byte sized automatic variables into an array in shared memory indexed by the threadId, access is free of bank conflicts.

tera · June 15, 2010, 4:38pm

I have no hard data, but based on anecdotal evidence consistent with own measurements, shared memory adds 2 clock cycles if one operand is from shared memory (6 instead of 4 cycles for most operations). If 2 operands come from shared memory, one is moved through a register, which adds another 6 cycles (12 in total). Writeback to shared memory appears to be free throughput-wise (latency-wise I guess it increases latency from 22 to 24 cycles, which would be hard to measure).

All this is pre-Fermi, and mostly speculative. No liability assumed. :) I’ll try to dig up a reference.

tera · June 15, 2010, 4:53pm

(double post removed)

lsolano · June 15, 2010, 5:54pm

I am trying to measure my kernel at cycles level. How are you measuring your code ?. Please, could you give me some ideas caz I need to do so ?

Thanks

laughingrice · June 16, 2010, 12:57pm

Actually, no

From what he says all threads access the same element in the array. The fact that he talked about 7 elements in shared memory which is a primary number suggests no bank conflicts. If it was 8 32bit variables in shared memory stored as an array and all threads access the same element (say, all threads access the first element in their own array) you would actually get 8 way bank conflicts.

What I would actually be interested in is how much of a difference the OP sees, and looking at the visual profiler, is there any thread serialization taking place.

Also how many runs have been taken, as run time is not consistent and the average number is the interesting one.

lsolano · June 16, 2010, 3:22pm

Every variable mapped to a register has its own instance in every thread. As I said I only mapped those variables to shared memory to use less registers. Therefore every thread access different elements in the array

laughingrice · June 16, 2010, 8:17pm

And that is where the problem lies with bank conflicts. Shared memory is split into 16 banks, so if two addresses modulo 16 give you the same number they are sitting on the same bank.

So if thread one accesses address 0 and thread 2 address 16 you will get two way bank conflicts.

If you would have had 8 consecutive 32bit values per thread in share memory and each thread accesses element 0 in it’s own array what you would have gotten is

thread 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

element 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120

bank 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8

which produces 8 way bank conflicts (and my experience is that this can mean a x2 or x3 performance difference, depending on arithmetic intensity and occupancy and thus latency hiding ability)

What saves you in this case is that you have 7 elements which is a primary number so the least common multiplier in this case is 16*7 so no bank conflicts (any prime apart for 2 assures no bank conflicts, any even number of elements assures that there are bank conflicts)

The only execption is if all threads were reading the same value (not bank, exact same pointer) in which case instead of a 16 way bank conflict you would have gotten a broadcast

tera · June 16, 2010, 9:10pm

So if thread one accesses address 0 and thread 2 address 16 you will get two way bank conflicts.

If you would have had 8 consecutive 32bit values per thread in share memory and each thread accesses element 0 in it’s own array what you would have gotten is

thread 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

element 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120

bank 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8

which produces 8 way bank conflicts (and my experience is that this can mean a x2 or x3 performance difference, depending on arithmetic intensity and occupancy and thus latency hiding ability)

What saves you in this case is that you have 7 elements which is a primary number so the least common multiplier in this case is 16*7 so no bank conflicts (any prime apart for 2 assures no bank conflicts, any even number of elements assures that there are bank conflicts)

You think too complicated: Your calculation would apply if the 8 variables per thread were put into a struct, and then an array of these structs indexed by the thread id were created.

In the straightforward way were each variable is separately turned into an array, no bank conflicts occur.

lsolano · June 16, 2010, 10:48pm

In fact each variable is turned into an array. Thanks tera and laughingrice for adding more ideas.

Topic		Replies	Views
Shared memory problem CUDA Programming and Performance	10	4085	April 20, 2010
Shared mem vs. registers CUDA Programming and Performance	3	1412	October 14, 2009
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3225	May 6, 2007
[Fermi] Number of registers CUDA Programming and Performance	36	20498	September 15, 2010
blocks vs threads CUDA Programming and Performance	14	18366	February 27, 2007
Optimizing bank conflicts - problem with occupancy CUDA Programming and Performance	12	2387	April 22, 2010
Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? CUDA Programming and Performance hw , cuda	11	5024	January 23, 2024
shard memory accesses is slow... CUDA Programming and Performance	27	9453	December 24, 2011
Shared memory as slow as global memory CUDA Programming and Performance	8	4593	September 5, 2016
Problems with using shared memory CUDA Programming and Performance	5	5914	September 14, 2009

Registers and Shared Memory

Related topics