How to understand the bank conflict of shared_mem

Hello, NV’s experts
I have a problem about the bank conflict of shared_mem.
I know that the shared_mem is divided into 32 banks, and 4bytes per bank.
GPU will raise bank conflict if different threads(in the same warp) access different addresses in a bank.
I do some test to check it.
test1, as following:

my shared_memory is 4 x 32,my threads are divided into 4 x 8, each thread store 4 floats(16bytes),there is not any bank conflicts。
I think it cannot match above theory, obviously, T0, T8, T16, T24 are in the same bank。other threads are similar with them, why GPU didn’t raise bank conflict?

test2, as following:

my shared_memory is 4 x 36,my threads are still divided into 4 x 8, each thread store 4 floats(16bytes),there is not any bank conflicts.

test3, as following:

my shared_memory is still 4 x 36,my threads are still divided into 4 x 8, but threads are transposed, each thread still store 4 floats(16bytes),I found bank conflict through NCU. why it raise bank conflict?

I’m confused with above 3 tests, how to explain them?
when GPU raise bank conflict and when won’t raise bank conflict?

When you store (or load) more than 4 bytes per thread, which is like saying more than 128 bytes per warp, the GPU does not issue a single transaction. The largest transaction size is 128 bytes. If you request 16 bytes per thread, then warp wide that will be a total of 512 bytes per request (warp-wide). The GPU will break that up into 4 transactions (in that case: T0-T7 make up a transaction, T8-T15 are a transaction, and so on), each of which is 128 bytes wide. The determination of bank conflicts is made per transaction, not per request or per warp or per instruction.

The second case is identical to the first in this respect. Considering just the threads 0 to 7, or just the threads 8-15, and the transaction associated with each, there is no bank conflict.

In the 3rd case, the request across the warp will be broken up the same way: threads 0-7 will constitute one transaction. And when we look at the activity for those threads, we see that for example threads 0-3 are writing to the same column(s). So we expect 4-way bank conflicts there.


thanks Robert, so kindly
I want to figure out the detail about shared_mem bank conflict, and try to find method to avoid it.
my application is: I want to transpose the existing registers into shared memory. like this:

my warp is divided into 8x4, and each thread hold 4 float4(s), or say 4x4 floats,or say 16 register per thread。
I want to transpose those registers into shared memory, but, I found I cannot avoid the bank conflict of shared_mem if I want to apply stsmem128。
Is there any other method to implement my application without bank conflict?
Would you like to teach me?

You could take a look at cub::WarpStore CUB: WarpStore< T, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH > Class Template Reference

hello, striker:
so cool name !
thanks for you suggestion at first
I found it cannot help my application.
the cub’s warp_transpose is different from my expectation
let me check its code, as following:

    #pragma unroll
    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
        thread_itr[(ITEM * BLOCK_THREADS)] = items[ITEM];

maybe, I am is wrong.
So, is there any other advise?

Some additional resources that may be of interest: 1 2 3

thank you, it match my expectation