Nsight compute seems to be giving the wrong number of bank conflicts

jose.perez.cano · March 10, 2026, 11:21am

I have some code that could be simplified like this:

#include <stdint.h>

__device__ __forceinline__ static uint laneId()
{
    uint ret;
    asm("mov.u32 %0, %%laneid;" : "=r"(ret));
    return ret;
}

__device__ __forceinline__ static uint warpId()
{
    const uint tid = (threadIdx.z * blockDim.y + threadIdx.y) * blockDim.x + threadIdx.x;
    return tid / 32;
}

template <typename T, typename S>
__device__ inline void store_as(S *p, const T& x)
{
    *reinterpret_cast<T *>(p) = x;
}

template <unsigned int N>
__device__ inline void store_uint16_vector(uint16_t *dest, const uint32_t *ptr);

template <>
__device__ inline void store_uint16_vector<1u>(uint16_t *dest, const uint32_t *ptr)
{
    dest[0] = static_cast<uint16_t>(ptr[0]);
}

template <>
__device__ inline void store_uint16_vector<2u>(uint16_t *dest, const uint32_t *ptr)
{
    ushort2 uint16x2;
    uint16x2.x = static_cast<uint16_t>(ptr[0]);
    uint16x2.y = static_cast<uint16_t>(ptr[1]);
    store_as<ushort2>(dest, uint16x2);
}

template <>
__device__ inline void store_uint16_vector<4u>(uint16_t *dest, const uint32_t *ptr)
{
    ushort4 uint16x4;
    uint16x4.x = static_cast<uint16_t>(ptr[0]);
    uint16x4.y = static_cast<uint16_t>(ptr[1]);
    uint16x4.z = static_cast<uint16_t>(ptr[2]);
    uint16x4.w = static_cast<uint16_t>(ptr[3]);
    store_as<ushort4>(dest, uint16x4);
}

template <>
__device__ inline void store_uint16_vector<8u>(uint16_t *dest, const uint32_t *ptr)
{
    uint4 uint32x4;
    store_uint16_vector<2u>(reinterpret_cast<uint16_t *>(&uint32x4.x), &ptr[0]);
    store_uint16_vector<2u>(reinterpret_cast<uint16_t *>(&uint32x4.y), &ptr[2]);
    store_uint16_vector<2u>(reinterpret_cast<uint16_t *>(&uint32x4.z), &ptr[4]);
    store_uint16_vector<2u>(reinterpret_cast<uint16_t *>(&uint32x4.w), &ptr[6]);
    store_as<uint4>(dest, uint32x4);
}

__global__ void kernel()
{
    const unsigned int warpid = warpId();
    const unsigned int laneid = laneId();

	__shared__ uint16_t smem[8][4][128 + 8];

    const unsigned int k = laneid * 16;
    const unsigned int k_hi = k / 128;
    const unsigned int k_lo = k % 128;
    uint32_t sum[16];
    store_uint16_vector<8u>(&smem[warpid][k_hi][k_lo], sum);
    __syncwarp();
}

The store function is translated into a single STS.128 instruction. Without the +8 padding, one would expect 8-way bank conflicts, which is what nsight compute showed initially. However, after the padding I would expect 4-way bank conflicts:

Lane	row	col	offset (in multiples of 32b)	base_bank	banks touched
0	0	0	0	0	{0,1,2,3}
1	0	1	8	8	{8,9,10,11}
2	0	2	16	16	{16,17,18,19}
3	0	3	24	24	{24,25,26,27}
4	0	4	32	0	{0,1,2,3}
5	0	5	40	8	{8,9,10,11}
6	0	6	48	16	{16,17,18,19}
7	0	7	56	24	{24,25,26,27}
8	1	0	68	4	{4,5,6,7}
9	1	1	76	12	{12,13,14,15}
10	1	2	84	20	{20,21,22,23}
11	1	3	92	28	{28,29,30,31}
12	1	4	100	4	{4,5,6,7}
13	1	5	108	12	{12,13,14,15}
14	1	6	116	20	{20,21,22,23}
15	1	7	124	28	{28,29,30,31}
16	2	0	136	8	{8,9,10,11}
17	2	1	144	16	{16,17,18,19}
18	2	2	152	24	{24,25,26,27}
19	2	3	160	0	{0,1,2,3}
20	2	4	168	8	{8,9,10,11}
21	2	5	176	16	{16,17,18,19}
22	2	6	184	24	{24,25,26,27}
23	2	7	192	0	{0,1,2,3}
24	3	0	204	12	{12,13,14,15}
25	3	1	212	20	{20,21,22,23}
26	3	2	220	28	{28,29,30,31}
27	3	3	228	4	{4,5,6,7}
28	3	4	236	12	{12,13,14,15}
29	3	5	244	20	{20,21,22,23}
30	3	6	252	28	{28,29,30,31}
31	3	7	260	4	{4,5,6,7}

In bank 0 only lanes 0, 4, 19 and 23 are colliding, so 4-way is what I thought was the bank conflict here. However, when running and profiling inside nsight compute I still got 8-way, which seems wrong, but the timing was maintained so maybe it is correct. The problem is that, if it is indeed 8-way, why do my computations revealed 4-way? What am I missing? Is the STS.128 instructions doing something under the hood that I do not understand? Is my analysis wrong? I would like to understand bank conflicts properly in this case.

Greg · March 11, 2026, 6:44pm

Given 32-bit predicated true threads the hardware will break-up large STS into N-groups of 32/N lanes. For STS.128 this is usually

wavefront 0 lanes 0-7
wavefront 1 lanes 7-15
wavefront 2 lanes 16-23
wavefront 3 lanes 24-31

If any full groups is predicated off it will be skipped. The ideal case for consecutive addresses with no padding is 4 sectors.

In the table above there is a bank conflict in all 4 of the wavefronts due to the additional 16 byte padding so the instruction will take 8 wavefronts of which 4 are bank conflicts.

jose.perez.cano · March 12, 2026, 9:47am

Your answer is not clear enough to me. What is a wavefront? What is a predicated true thread? And if you say there are 4 bank conflicts why do I see 8-way bank conflict? Moreover, why is padding doing nothing?

striker159 · March 12, 2026, 11:43am

Predicated true threads are threads for which the instruction is issued. Threads might be predicated off if there is warp-divergence, for example.

You can think of wavefronts as shared memory hardware transactions of size up to 128 bytes.

The warp-wide STS.128 will write 512 bytes, which have to be split into 128-byte transactions. That is 8 threads with 16 bytes per transaction, and Greg states that the groups of 8 threads are not arbitrary, but simply consecutive 8 threads.

Bank conflicts occur within a 128-byte transaction. Greg has explained the inner workings of shared memory in older postings, such as here: Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? - #12 by Greg

In your case, there is a 2-way conflict in each of the four thread groups, which leads to 8 hardware transactions instead of 4.

jose.perez.cano · April 6, 2026, 2:20pm

Okay, thank you. But how do I optimize the code to avoid having such bank conflicts?

Curefab · April 6, 2026, 5:05pm

Is it on purpose that you store with gaps (not talking about the padding)?

Each thread writes 8 short = 128 bits = 16 bytes.

But you multiply the index by 16. 16 short = 32 bytes.

It is possible to access shared memory that way, just wanted to make sure, it was on purpose.

According to greg’s post, groups of 8 threads have to use all banks. 8 threads x 4 banks/thread.

According to your table none of your lanes 0..7 access banks 4-7, etc.

Is the data, which each thread processes, fixed? Is the order of the threads fixed? Is the data layout in shared memory fixed?

Try

smem[8][8][64 + 8];

and

    const unsigned int k_hi = k / 64;
    const unsigned int k_lo = k % 64;

Topic		Replies	Views
Does this code cause bank conflicts? Nsight Compute cuda , kernel	4	1424	September 6, 2024
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	16332	November 19, 2025
Shared memory bank conflict CUDA Programming and Performance	4	623	July 30, 2025
Problems about Profiling Shared Memory Bank Conflicts using nsight-compute Nsight Compute	2	1741	January 25, 2022
Problem about bank conflict test CUDA Programming and Performance	5	707	March 12, 2024
Questions about "L1 Conflicts Shared N-way" & metrics related to "Excessive" CUDA Programming and Performance	6	729	July 1, 2025
128-bit access bank conflict CUDA Programming and Performance	11	1396	March 29, 2024
Ncu detects bank conflicts in matrix transposition after padding Nsight Compute cuda	5	1535	January 30, 2023
Is there any way to find out the location in cuda code that cause shared memory bank conflicts? CUDA Programming and Performance	6	1311	January 21, 2022
The increase of the shared memory size leads to the bankconflict (from 9 KB shared memory) Nsight Compute	5	647	July 14, 2023

Nsight compute seems to be giving the wrong number of bank conflicts

Related topics