Nsight compute seems to be giving the wrong number of bank conflicts

I have some code that could be simplified like this:

#include <stdint.h>

__device__ __forceinline__ static uint laneId()
{
    uint ret;
    asm("mov.u32 %0, %%laneid;" : "=r"(ret));
    return ret;
}

__device__ __forceinline__ static uint warpId()
{
    const uint tid = (threadIdx.z * blockDim.y + threadIdx.y) * blockDim.x + threadIdx.x;
    return tid / 32;
}

template <typename T, typename S>
__device__ inline void store_as(S *p, const T& x)
{
    *reinterpret_cast<T *>(p) = x;
}

template <unsigned int N>
__device__ inline void store_uint16_vector(uint16_t *dest, const uint32_t *ptr);

template <>
__device__ inline void store_uint16_vector<1u>(uint16_t *dest, const uint32_t *ptr)
{
    dest[0] = static_cast<uint16_t>(ptr[0]);
}

template <>
__device__ inline void store_uint16_vector<2u>(uint16_t *dest, const uint32_t *ptr)
{
    ushort2 uint16x2;
    uint16x2.x = static_cast<uint16_t>(ptr[0]);
    uint16x2.y = static_cast<uint16_t>(ptr[1]);
    store_as<ushort2>(dest, uint16x2);
}

template <>
__device__ inline void store_uint16_vector<4u>(uint16_t *dest, const uint32_t *ptr)
{
    ushort4 uint16x4;
    uint16x4.x = static_cast<uint16_t>(ptr[0]);
    uint16x4.y = static_cast<uint16_t>(ptr[1]);
    uint16x4.z = static_cast<uint16_t>(ptr[2]);
    uint16x4.w = static_cast<uint16_t>(ptr[3]);
    store_as<ushort4>(dest, uint16x4);
}

template <>
__device__ inline void store_uint16_vector<8u>(uint16_t *dest, const uint32_t *ptr)
{
    uint4 uint32x4;
    store_uint16_vector<2u>(reinterpret_cast<uint16_t *>(&uint32x4.x), &ptr[0]);
    store_uint16_vector<2u>(reinterpret_cast<uint16_t *>(&uint32x4.y), &ptr[2]);
    store_uint16_vector<2u>(reinterpret_cast<uint16_t *>(&uint32x4.z), &ptr[4]);
    store_uint16_vector<2u>(reinterpret_cast<uint16_t *>(&uint32x4.w), &ptr[6]);
    store_as<uint4>(dest, uint32x4);
}

__global__ void kernel()
{
    const unsigned int warpid = warpId();
    const unsigned int laneid = laneId();

	__shared__ uint16_t smem[8][4][128 + 8];

    const unsigned int k = laneid * 16;
    const unsigned int k_hi = k / 128;
    const unsigned int k_lo = k % 128;
    uint32_t sum[16];
    store_uint16_vector<8u>(&smem[warpid][k_hi][k_lo], sum);
    __syncwarp();
}

The store function is translated into a single STS.128 instruction. Without the +8 padding, one would expect 8-way bank conflicts, which is what nsight compute showed initially. However, after the padding I would expect 4-way bank conflicts:

Lane row col offset (in multiples of 32b) base_bank banks touched
0 0 0 0 0 {0,1,2,3}
1 0 1 8 8 {8,9,10,11}
2 0 2 16 16 {16,17,18,19}
3 0 3 24 24 {24,25,26,27}
4 0 4 32 0 {0,1,2,3}
5 0 5 40 8 {8,9,10,11}
6 0 6 48 16 {16,17,18,19}
7 0 7 56 24 {24,25,26,27}
8 1 0 68 4 {4,5,6,7}
9 1 1 76 12 {12,13,14,15}
10 1 2 84 20 {20,21,22,23}
11 1 3 92 28 {28,29,30,31}
12 1 4 100 4 {4,5,6,7}
13 1 5 108 12 {12,13,14,15}
14 1 6 116 20 {20,21,22,23}
15 1 7 124 28 {28,29,30,31}
16 2 0 136 8 {8,9,10,11}
17 2 1 144 16 {16,17,18,19}
18 2 2 152 24 {24,25,26,27}
19 2 3 160 0 {0,1,2,3}
20 2 4 168 8 {8,9,10,11}
21 2 5 176 16 {16,17,18,19}
22 2 6 184 24 {24,25,26,27}
23 2 7 192 0 {0,1,2,3}
24 3 0 204 12 {12,13,14,15}
25 3 1 212 20 {20,21,22,23}
26 3 2 220 28 {28,29,30,31}
27 3 3 228 4 {4,5,6,7}
28 3 4 236 12 {12,13,14,15}
29 3 5 244 20 {20,21,22,23}
30 3 6 252 28 {28,29,30,31}
31 3 7 260 4 {4,5,6,7}

In bank 0 only lanes 0, 4, 19 and 23 are colliding, so 4-way is what I thought was the bank conflict here. However, when running and profiling inside nsight compute I still got 8-way, which seems wrong, but the timing was maintained so maybe it is correct. The problem is that, if it is indeed 8-way, why do my computations revealed 4-way? What am I missing? Is the STS.128 instructions doing something under the hood that I do not understand? Is my analysis wrong? I would like to understand bank conflicts properly in this case.

Given 32-bit predicated true threads the hardware will break-up large STS into N-groups of 32/N lanes. For STS.128 this is usually

wavefront 0 lanes 0-7
wavefront 1 lanes 7-15
wavefront 2 lanes 16-23
wavefront 3 lanes 24-31

If any full groups is predicated off it will be skipped. The ideal case for consecutive addresses with no padding is 4 sectors.

In the table above there is a bank conflict in all 4 of the wavefronts due to the additional 16 byte padding so the instruction will take 8 wavefronts of which 4 are bank conflicts.

Your answer is not clear enough to me. What is a wavefront? What is a predicated true thread? And if you say there are 4 bank conflicts why do I see 8-way bank conflict? Moreover, why is padding doing nothing?

Predicated true threads are threads for which the instruction is issued. Threads might be predicated off if there is warp-divergence, for example.

You can think of wavefronts as shared memory hardware transactions of size up to 128 bytes.

The warp-wide STS.128 will write 512 bytes, which have to be split into 128-byte transactions. That is 8 threads with 16 bytes per transaction, and Greg states that the groups of 8 threads are not arbitrary, but simply consecutive 8 threads.

Bank conflicts occur within a 128-byte transaction. Greg has explained the inner workings of shared memory in older postings, such as here: Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? - #12 by Greg

In your case, there is a 2-way conflict in each of the four thread groups, which leads to 8 hardware transactions instead of 4.

Okay, thank you. But how do I optimize the code to avoid having such bank conflicts?

Is it on purpose that you store with gaps (not talking about the padding)?

Each thread writes 8 short = 128 bits = 16 bytes.

But you multiply the index by 16. 16 short = 32 bytes.

It is possible to access shared memory that way, just wanted to make sure, it was on purpose.





According to greg’s post, groups of 8 threads have to use all banks. 8 threads x 4 banks/thread.

According to your table none of your lanes 0..7 access banks 4-7, etc.

Is the data, which each thread processes, fixed? Is the order of the threads fixed? Is the data layout in shared memory fixed?

Try

smem[8][8][64 + 8];

and

    const unsigned int k_hi = k / 64;
    const unsigned int k_lo = k % 64;