Optimized shared memory index mapping function

I am presently trying to improve performance using shared memory. I use smem only to improve performance, not for sharing data inside a block between threads. I allocate 512 bytes as uint8_t array, which gives me a 4x4 byte matrix for every bank. Every thread is only supposed to address one bank in order to eliminate bank conflicts.

__shared__ u8 box[512];

I use the following function to map the 4x4 matrix to shared memory.

__device__ u32 expand_i(u32 i, u32 tid) {

//	return (i / 4) * 128 + (i % 4) + (tid * 4); // old code - slow!
	return ((i >> 2) << 7) + (i & 3) + (tid << 2); // improved code - is this the best possible?

Mapped indexes are supposed to map as follows:

0 -> 0
1 -> 1
2 -> 2
3 -> 3
4 -> 128
8 -> 256

All above is pretty standard.

However Nsight Compute shows:

  • Compute Throughput 23%
  • Memory Bandwidth 3%
  • Stall Wait 13%

Basically every section with exclamation marks directs me to Source Counters, where I see a BRA instruction as major stall factor inside expand_i().

Presently, I don’t see how I could further optimize expand_i()?

Besides that, I’m not sure if I read/interprete the Nsight Compute output correctly, and that expand_i() really is the culprit for the low compute throughput.

Suppose you have:

int my_index = expand_i(0, threadIdx.x);
unsigned char my_value = box[my_index];

That will result in thread 0 reading byte 0 (which happens to be in bank 0) and thread 1 reading byte 4 (which happens to be in bank 1) and thread 2 reading byte 8 (which happens to be in bank 2), etc.

That is good as far as it goes - no bank conflicts. But it is still inefficient use of the LD/ST unit. You should aim for 128 bytes used per warp-wide transaction, not 32.

Now, obviously this is going to depend on your code structure. But suppose each thread needs to access the elements of its “box” in sequence. Like this:

int my_index = expand_i(0, threadIdx.x);
unsigned char my_value = box[my_index];
// do something with my_value/box element 0
my_index = expand_i(1, threadIdx.x);
my_value = box[my_index];
// do something with my_value/box element 1
my_index = expand_i(2, threadIdx.x);
my_value = box[my_index];
// do something with my_value/box element 2
my_index = expand_i(3, threadIdx.x);
my_value = box[my_index];
// do something with my_value/box element 3

A more efficient approach might be:

int my_index = expand_i(0, threadIdx.x);
my_index  >>= 2;
uchar4 my_value = ((uchar4 *)box)[my_index];
// do something with my_value.x/box element 0
// do something with my_value.y/box element 1
// do something with my_value.z/box element 2
// do something with my_value.w/box element 3

This requires proper alignment of the base address for box. I believe if you declare it as shown, it will be, but if you want to make that explicit, then declare it as:

__shared__ uchar4 box[128];

and if you do that, then it probably makes sense to refactor your expand function to handle the arithmetic correctly rather than this:

my_index  >>= 2;

Whether or not any of this will help with code performance, I’m not sure. This item:

suggests it may not help, but that info is for global memory, not shared memory. You haven’t indicated any of the data that would be relevant for improving shared access, from what I can see.

An application that has both low compute utilization and low memory utilization is often latency bound. Such an application may not be exposing enough parallelism to take advantage of the GPU, and in that case doing compute optimizations or memory optimizations often won’t help with performance. Studying your expand/indexing function, it makes it look like you are only using 32 threads per block, which is undoubtedly a poor usage pattern for GPU code.

However, since you asked about improving the expand function, this type of refactoring I suggest is likely to allow a simplification of that function. Something like:

__device__ u32 expand_i(u32 i, u32 tid) { return tid + 32 * i;}

(but none of what you have in that function now looks very “expensive” to me, and I can’t for the life of me figure out why a BRA instruction would appear in the midst of that code you have shown. I suspect that something in the description doesn’t add up.)

Yes, that’s correct!

Every box is populated by every thread with different random numbers (which are made modulo 16).

for (i = 0; i < 16; i++) {
        box[expand_i(i, threadIdx.x)] = { 4, 6, 1, 14, ... }; // random data % 16

If box elements would be read sequentially, your code modification would transfer indeed 4x more data using the LD/ST unit efficiently. That’s also what I could read in the Nsight Compute output (L1TEX, L2 Load). However every byte read from the box to my_value modifies the my_index of the next read, and this might not be within the 4 aligned bytes, which could be read in 1 cycle during a warp-transaction. Hence, the (1 byte) read accesses are shuffled randomly across the 4x4 matrix. All this makes data read completely random. I believe this is the biggest challenge in this project.

Global memory accesses are limited to 1 access per thread, so I believe this doesn’t impact general performance a lot.

I use only 32 threads because a later extension of this project might end up in a box of matrix 4x192 x 32 threads per block = 24576 bytes, which is half of the shared memory (48KB). Nevertheless, after reading your post, I increased the threads per block gradually from 32 up to 256 (and changed the code accordingly to cope with), but I didn’t see more than 5% performance increase. There must be something else.

I’ll take a closer look into this BRA instruction. I believe I miss something.

Still trying to improve my lousy performance.

While focusing on the the shared memory byte array mapping function (which is not the main issue regarding performance), I tried the following:

for (i = 0; i < 256; i++) {
//    s32[map_i(i, tid)] = i; // 23 mops/s
//    s32[map_i_asm(i, tid)] = i; // 23 mops/s
//    s32[map_i_macro(i, tid)] = i; // 26 mops/s
	s32[((i >> 2) << 7) + (i & 3) + (tid << 2)] = i; // 26 mops/s
__device__ u32 map_i(u32 i, u32 tid) {
	return ((i >> 2) << 7) + (i & 3) + (tid << 2);
__device__ u32 map_i_asm(u32 i, u32 tid) {
    u32 i_s32;
    asm(".reg .u32 t0, t1, t2, t3, t4;" // asm code far from optimal (my first attempt)
        "shr.b32 t0, %1, 2;" // i >> 2
        "shl.b32 t1, t0, 7;" // t0 << 7
        "shl.b32 t2, %2, 2;" // tid << 2
        "and.b32 t3, %1, 3;" // i & 3
        "add.u32 t4, t1, t2;" // add step 1
        "add.u32 %0, t4, t3;" // add step 2
        : "=r"(i_s32) : "r"(i), "r"(tid));
    return i_s32;
#define map_i_macro(i, tid) ((((i) >> 2) << 7) + ((i) & 3) + ((tid) << 2))

The map_i() and map_i_asm() functions come up with the exact same performance (23 mops/s) while the macro and the “direct” coding (without function call or macro) is slightly faster (26 mops/s).

Two questions:

  1. I was surprised to see that map_i() is somewhat slower than the implementation without function/macro call. How come, since it was inlined? I don’t see the reason while looking at SASS.

  2. Checking at SASS, I see quite often MOV Rx, Rx throughout the entire SASS code. This should not do anything. Why is that?

(1) Are you sure you are looking at the code generated from a release build with full optimization?

(2) Both machine instructions and the ABI may force the use of specific registers, and if regular register allocation does not line up with that, there will be an occasional register-to-register MOV. For example, instructions consuming or producing 64-bit data require that data to reside in aligned register pairs, i.e. with the less significant bits in an even-numbered register.

Without quantifying “quite often” it is impossible to say whether there is something to be concerned about. Much in the world of compilers is driven by heuristics to make problems with NP complexity solvable in reasonable (polynomial instead of exponential) time; this leads to good results on average but there are always that few percent of programs for which this doesn’t work out so well.

Compiling for sm_75, I see no difference in the SASS generated for “old code” and “improved code”. Are you sure you are looking at the SASS generated by a release build?

Hi Geohei,

How many times do you call expand_i() per thread and also how many blocks are you launching?

And kernel duration might be useful also - just trying to get some sort of overview.

That’s it. Shame on me. It was the Debug code I looked at. Now I get 81% on both, Compute and Memory (global - very little used though) performance. Basically I could be happy, but I believe there’s more take out.

Thanks for the information (I didn’t know), but I believe your explanations were solely for the Debug binary. The Release binary doesn’t show these repeating MOV Rx, Rx patterns. With “often”, I meant about 5% of the entire code instructions. What I still don’t get is why moving from register x to the same register x is done (that doesn’t do anything).


Correct - the difference only showed in Debug code.

expand_i() (I renamed it to map_i()) is called about 512 times per thread (subject to 0-5% +variation based on result). I get about 600 million thread calls per second (CC 7.5).

Without context it is impossible to say whether that serves some functional purpose (e.g. no-op padding for alignment) or is simply a – presumably rare – optimization artifact (e.g. caused by phase ordering).

Is below described kernel more or less at its max. possible performance?

Basically, it’s a 4x4 and 16x16 byte (uint8_t) 2D matrix array read/write accessed non-sequentially. The matrix is located in shared memory (for performance only - no data exchange between threads) and accessed 100% free of bank conflicts (see Q2 below). I use 32 threads per block (due present code setup). 64 threads and above would deteriorate performance due bank conflicts (see Q1 below).

Nsight Computer tells me:

  • Compute throughput : 81%
  • Memory throughput : 81% (very little global mem access)
  • LSU pipe utilization : 80% (like Robert Crovella assumed earlier - confirmed now)
  • Memory charts shows 38% peak usage r/w shared memory. I don’t know how I could possibly improve this. I believe the array index mapping formula shown earlier is responsible for these missing %.
    (i >> 2) << 7) + (i & 3) + (threadIdx.x << 2)
  • Stall Short Scoreboard : 3.3. The 5 top source counters show IADD3.

I have no code similar to mine to compare performance with, but I have the “feeling” there’s more to draw out of the GPU, especially when I read things like:

Occupancy Limiters: This kernel’s theoretical occupancy (25.0%) is limited by the required amount of shared memory.

I decided on 32 threads per block to remain shared memory conflict free.

Q1: Is it correct that e.g. 64 threads per block would, by design, necessarily generate bank conflicts (RTX 2080 SUPER)?

Shared memory wise, I fill only #22% of the shared memory. In this respect, there would be enough room for 4x more shared memory usage.

Q2: Nsight Compute shows about 2% shared memory bank conflicts if I exceed 256 blocks (for testing only). This number changes a bit for every test I run. Where do these bank conflict come from since the kernel is conflict free?

No. There is no such thing as a bank conflict when considering two different instructions, or two different warps.

1 Like

I am not following this logic. This indexing arithmetic represents just a few ALU operations and should not be performance limiting at all. You can double check the generated SASS, but I would expect:

(1) threadIdx.x << 2 gets pre-computed once in the kernel, as threadIdx.x does not change for the duration of the kernel (assuming reasonable register pressure doesn’t prevent allocating a temp register to hold this value).

(2) (i >> 2) << 7) gets mapped to ((i & ~3) << 5) to replace one shift operation, as shifts are generally more “expensive” than ALU operations

(3) the three parts of the expression are added together with IADD3. Alternatively the compiler may chose an LEA type instruction that combines shift and add operations.

The discussion around residual bank conflicts here, may apply.

Also, is your kernel solely based on byte operations and if so, is it logic only or does it have significant arithmetic content?

It’s a bit difficult when we can’t see the code, but here are some observations, as I recently finished something similar in the sense of a byte based kernel, using frequent inner loop, conflict free random read access to a uint32_t table[256]. For what it’s worth, the same kernel, when accessing the table with conflicts and using uint16_t values, gained about 50% in performance switching to conflict free, but we’re probably comparing apples with bananas here.

In the absence of code, given you state above that map_i() is called 512 times, I’m assuming you are in an inner loop. So given no global memory access, the number and type of instructions in the loop will directly impact shared memory peak access. If you are efficiently achieving what you need to in the loop, then the lowish shared access figure is the side effect. For context, my kernel on SM6.1 has an inner loop of 19 instructions , (4 x LD.SHARED, 4 x DP4A and the rest LOP), and the peak shared usage is around 80%.

The reason I asked above “is your kernel solely based on byte operations and if so, is it logic only or does it have significant arithmetic content?”, is if the answer is yes to the former, then you can maybe gain a significant amount by rewriting the kernel in SWAR or “bytesliced” form, so that each thread is producing 4 results. Again, this was a step I took and it also improved performance around 50%.

This does not necessarily indicate a problem. My kernel also shows Compute and Memory throughput around 80% and shared memory usage (just due to the conflict free table), limits me to only having 1 block of 768 threads per SM, for best performance. Taking note of the Occupancy graphs in Nsight Compute was useful here.

Edited: To remove the section on conflicts, as given you say you don’t see any, you’re presumably using an adjusted version of map_i().

This is the SASS. Indeed no bottleneck.

((i >> 2) << 7) + (i & 3) + (threadIdx.x << 2)

LOP3.LUT R13, R7, 0x3, RZ, 0xc0, !PT 
IMAD.SHL.U32 R11, R7, 0x20, RZ 
IMAD R12, R4, 0x4, R13 
LOP3.LUT R11, R11, 0x1f80, RZ, 0xc0, !PT 

Interesting. Thanks!

Yes, the payload data is completely based on bytes. The box[] indices are u32 though.

Executed Instruction Mix show IMAD, STS, LOP3, IADD3, LDS, … in decrasing order.

I checked again. I can’t publish the code due policy (#&%ยง*).

Very good point. So basically gaining by multiplying resources and accepting loss due bank conflics - overall balance is a gain. I translated your idea to my code, though 1:1 is not possible.


// 1x  4x64 matrix ;  8KB smem ;  32 thread blocks ; 608 mops/s ; 100% smem usage
return (i / 4) * 128 + (i % 4) + (tid * 4); // 1
//       no matrix ;  8KB smem ;  32 thread blocks , 436 mops/s ; 100% smem usage ; BANK CONFLICTS!
return (i * 32) + tid; // 2
//       no matrix ; 16KB smem ;  64 thread blocks , 380 mops/s ; 100% smem usage ; BANK CONFLICTS!
return (i * 64) + tid; // 3
// 4x 1x256 matrix ; 32KB smem ; 128 thread blocks , 355 mops/s ; 100% smem usage ; BANK CONFLICTS!
return (i * 128) + tid; // 4

The first return is my max. achieved performance (just for reference). Each thread stays in a dedicated bank. No conflicts. More space utilization is not possible since I use already all 32 bits of the shared memory address. Considering the entire project, I’m limited by shared memory space, since every block runs 32 threads à 8192 bytes. This seems to be the optimium since box[] (256 bytes) is in a 4x64 bytes “virtual” box of shared memory, and this goes for 32 threads (4 * 64 * 32 = 8192 bytes). So theoretically, 8 warps per SM on 48 SMs (RTX 2080 SUPER) = 384 blocks are executed concurrently on the GPU. Is that correct?

The other 3 returns just spread the box[] in 32, 64 or 128 byte steps sequentially into shared memory, thus creating bank conflicts and less performance.

I did this on a test project 2 years ago. However this was bitsliced (even a bit more complicated) and using registers. Huge gain in performance indeed. However in subject example, bytes are r/w accessed non-sequentially and in a different pattern for every thread. So byte-slicing won’t work.

I may have confused the point I was making here, by mentioning I changed from uint16_t to uint32_t. All I was really saying was that switching from a heavily bank conflicted situation, to conflict free, was a large gain.

1 Like

You should be able to remain conflict free for block sizes larger than 32, by modifying map_i() to:

((i / 4) * 128) + ((threadIdx.x / 32) * 8192) + (i % 4) + ((threadIdx.x % 32) * 4);

So thread 32 maps back to Bank 0, which is not a conflict with thread 0, because it’s in a different warp.

1 Like

After lots of testing, I think that I can’t get out more of my kernel. The only possible option would be registers. A 256 byte array in registers isn’t possible (if kernel doesn’t know the indices at compile time - which is the case for subject kernel), so I tried to implement a kernel function which “simulates” an array.

__device__ u8 s256(uint8_t i, uint16_t w = 999) {
    static uint32_t d00 = 0x52f32ef1;
    static uint32_t d63 = 0x5d19ba40;
    if (w == 999) {
        switch(i) {
            case   0: return (d00 >>  0) & 0xff;
            case 255: return (d63 >> 24) & 0xff;
    } else {
        switch(i) {
            case   0: p00 = (d00 & 0xffffff00) ^ (w <<  0); return;
            case 255: p63 = (d63 & 0x00ffffff) ^ (w << 24); return;

s256(i) reads data from index i.
s256(i, 0xee) writes data to index i.
The array initializes only once due static qualifier.
My hope was that nvcc puts d00-d63 in registers, which didn’t happen.
Kernel performance was more than bad.
I didn’t investigate why.
Thanks for all the comments above. All very helpful.