Hello, NV’s experts
I have a problem about the bank conflict of shared_mem.
I know that the shared_mem is divided into 32 banks, and 4bytes per bank.
GPU will raise bank conflict if different threads(in the same warp) access different addresses in a bank.
I do some test to check it.
test1, as following:
my shared_memory is 4 x 32，my threads are divided into 4 x 8, each thread store 4 floats(16bytes)，there is not any bank conflicts。
I think it cannot match above theory, obviously, T0, T8, T16, T24 are in the same bank。other threads are similar with them, why GPU didn’t raise bank conflict?
my shared_memory is still 4 x 36，my threads are still divided into 4 x 8, but threads are transposed, each thread still store 4 floats(16bytes)，I found bank conflict through NCU. why it raise bank conflict?
I’m confused with above 3 tests, how to explain them?
when GPU raise bank conflict and when won’t raise bank conflict?
When you store (or load) more than 4 bytes per thread, which is like saying more than 128 bytes per warp, the GPU does not issue a single transaction. The largest transaction size is 128 bytes. If you request 16 bytes per thread, then warp wide that will be a total of 512 bytes per request (warp-wide). The GPU will break that up into 4 transactions (in that case: T0-T7 make up a transaction, T8-T15 are a transaction, and so on), each of which is 128 bytes wide. The determination of bank conflicts is made per transaction, not per request or per warp or per instruction.
The second case is identical to the first in this respect. Considering just the threads 0 to 7, or just the threads 8-15, and the transaction associated with each, there is no bank conflict.
In the 3rd case, the request across the warp will be broken up the same way: threads 0-7 will constitute one transaction. And when we look at the activity for those threads, we see that for example threads 0-3 are writing to the same column(s). So we expect 4-way bank conflicts there.
thanks Robert, so kindly
I want to figure out the detail about shared_mem bank conflict, and try to find method to avoid it.
my application is: I want to transpose the existing registers into shared memory. like this:
my warp is divided into 8x4, and each thread hold 4 float4(s), or say 4x4 floats，or say 16 register per thread。
I want to transpose those registers into shared memory, but, I found I cannot avoid the bank conflict of shared_mem if I want to apply stsmem128。
Is there any other method to implement my application without bank conflict?
Would you like to teach me?