Registers and Shared Memory

Hi there,

I have a kernel that reads data from global memory, do some calculations and copy data back to the global memory. For this kernel I wrote two versions (both versions have 256 threads per block):
First version: every thread utilizes 8 integer variables and 7 float variables (all these variables are mapped to registers). In these case I’m using roughly 16K registers per block and no shared memory.
Second version: every thread utilizes 8 integer variables and the float variables are mapped to shared memory. In these case I am utilizing 8K registers per block and roughly 8K shared memory per block.

My GPU is a Tesla T10 with 16K registers and 16K shared memory.

When I time both versions, the first version is faster than the second one. I understand that the access time to shared memory can be slightly higher than the access time to registers, however, in the second version I can have 2 blocks assigned to a SM whereas in the first version I have only one block assigned to the SM.

Question is: Due to the fact that I can accommodate 2 blocks into a SM, second version should be faster, however the timings show this is not true. What is the explanation ?

Thanks,

See what happens in the Occupancy calculator:

1st case: 256 threads, 15 registers, 0 smem - 100%

2nd case: 256 threads, 8 registers, 8000 smem - 50%

Probably explains what you see :)

eyal

First of all, you can check occupancy using the profiler, but from some point (~192 threads) you are not always going to see much difference due to higher occupancy (192 threads are enough to hide required register delay of read after write), it depends on how much delay there is to hide.

It depends on your access patterns but you may be seeing bank conflicts. Try putting the data in columns in shared memory instead of rows (i.e put the first variable for all threads concurrently then the second, which will come out to a stride of 256 between variables for each thread). This should make sure that all data for each thread is sitting in a single bank and would remove bank conflicts

Bank conflicts in shared memory is probably the last thing to check with regard to performance issues.

In the first case you are only using 15*256=3840 registers. There are 16K word-wide registers equating to 64KB of register memory, but only 16KB of shared memory.

well then, I guess we do different things, but it is usually the most significant performance issue I see in my code apart for global memory access pattern

Thanks Tera, so each register is 4-bytes wide ? if so, an integer and float number fits in one register, right ?

Thanks Eyal, Yes, it explains what I see. I was also confused thinking that I needed 4 registers to store a float/int value which is wrong, that’s why I said I was using roughly 16K registers (4 bytes x 16 regs x 256 threads), when I was using only 4K registerss (16 regs x 256 threads)

Thanks, in my case the data in shared memory do not have any particular order, caz it is only to store temporary values. I only mapped them to shared memory to use less registers.

On a somewhat related note, in general how much faster is access to registers compared to shared memory for Fermis and non-Fermis?

Yes.

Which likely means there will be no bank conflicts: If you turn 4-byte sized automatic variables into an array in shared memory indexed by the threadId, access is free of bank conflicts.

I have no hard data, but based on anecdotal evidence consistent with own measurements, shared memory adds 2 clock cycles if one operand is from shared memory (6 instead of 4 cycles for most operations). If 2 operands come from shared memory, one is moved through a register, which adds another 6 cycles (12 in total). Writeback to shared memory appears to be free throughput-wise (latency-wise I guess it increases latency from 22 to 24 cycles, which would be hard to measure).

All this is pre-Fermi, and mostly speculative. No liability assumed. :) I’ll try to dig up a reference.

(double post removed)

I am trying to measure my kernel at cycles level. How are you measuring your code ?. Please, could you give me some ideas caz I need to do so ?

Thanks

Actually, no

From what he says all threads access the same element in the array. The fact that he talked about 7 elements in shared memory which is a primary number suggests no bank conflicts. If it was 8 32bit variables in shared memory stored as an array and all threads access the same element (say, all threads access the first element in their own array) you would actually get 8 way bank conflicts.

What I would actually be interested in is how much of a difference the OP sees, and looking at the visual profiler, is there any thread serialization taking place.

Also how many runs have been taken, as run time is not consistent and the average number is the interesting one.

Every variable mapped to a register has its own instance in every thread. As I said I only mapped those variables to shared memory to use less registers. Therefore every thread access different elements in the array

And that is where the problem lies with bank conflicts. Shared memory is split into 16 banks, so if two addresses modulo 16 give you the same number they are sitting on the same bank.

So if thread one accesses address 0 and thread 2 address 16 you will get two way bank conflicts.

If you would have had 8 consecutive 32bit values per thread in share memory and each thread accesses element 0 in it’s own array what you would have gotten is

thread 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

element 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120

bank 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8

which produces 8 way bank conflicts (and my experience is that this can mean a x2 or x3 performance difference, depending on arithmetic intensity and occupancy and thus latency hiding ability)

What saves you in this case is that you have 7 elements which is a primary number so the least common multiplier in this case is 16*7 so no bank conflicts (any prime apart for 2 assures no bank conflicts, any even number of elements assures that there are bank conflicts)

The only execption is if all threads were reading the same value (not bank, exact same pointer) in which case instead of a 16 way bank conflict you would have gotten a broadcast

You think too complicated: Your calculation would apply if the 8 variables per thread were put into a struct, and then an array of these structs indexed by the thread id were created.

In the straightforward way were each variable is separately turned into an array, no bank conflicts occur.

In fact each variable is turned into an array. Thanks tera and laughingrice for adding more ideas.