How to understand the bank conflict of shared_mem

Hello, NV’s experts
I have a problem about the bank conflict of shared_mem.
I know that the shared_mem is divided into 32 banks, and 4bytes per bank.
GPU will raise bank conflict if different threads(in the same warp) access different addresses in a bank.
I do some test to check it.
test1, as following:


my shared_memory is 4 x 32,my threads are divided into 4 x 8, each thread store 4 floats(16bytes),there is not any bank conflicts。
I think it cannot match above theory, obviously, T0, T8, T16, T24 are in the same bank。other threads are similar with them, why GPU didn’t raise bank conflict?

test2, as following:


my shared_memory is 4 x 36,my threads are still divided into 4 x 8, each thread store 4 floats(16bytes),there is not any bank conflicts.

test3, as following:


my shared_memory is still 4 x 36,my threads are still divided into 4 x 8, but threads are transposed, each thread still store 4 floats(16bytes),I found bank conflict through NCU. why it raise bank conflict?

I’m confused with above 3 tests, how to explain them?
when GPU raise bank conflict and when won’t raise bank conflict?

3 Likes

When you store (or load) more than 4 bytes per thread, which is like saying more than 128 bytes per warp, the GPU does not issue a single transaction. The largest transaction size is 128 bytes. If you request 16 bytes per thread, then warp wide that will be a total of 512 bytes per request (warp-wide). The GPU will break that up into 4 transactions (in that case: T0-T7 make up a transaction, T8-T15 are a transaction, and so on), each of which is 128 bytes wide. The determination of bank conflicts is made per transaction, not per request or per warp or per instruction.

The second case is identical to the first in this respect. Considering just the threads 0 to 7, or just the threads 8-15, and the transaction associated with each, there is no bank conflict.

In the 3rd case, the request across the warp will be broken up the same way: threads 0-7 will constitute one transaction. And when we look at the activity for those threads, we see that for example threads 0-3 are writing to the same column(s). So we expect 4-way bank conflicts there.

15 Likes

thanks Robert, so kindly
I want to figure out the detail about shared_mem bank conflict, and try to find method to avoid it.
my application is: I want to transpose the existing registers into shared memory. like this:

my warp is divided into 8x4, and each thread hold 4 float4(s), or say 4x4 floats,or say 16 register per thread。
I want to transpose those registers into shared memory, but, I found I cannot avoid the bank conflict of shared_mem if I want to apply stsmem128。
Is there any other method to implement my application without bank conflict?
Would you like to teach me?

You could take a look at cub::WarpStore CUB: WarpStore< T, ITEMS_PER_THREAD, ALGORITHM, LOGICAL_WARP_THREADS, PTX_ARCH > Class Template Reference

hello, striker:
so cool name !
thanks for you suggestion at first
I found it cannot help my application.
the cub’s warp_transpose is different from my expectation
let me check its code, as following:

    #pragma unroll
    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
    {
        thread_itr[(ITEM * BLOCK_THREADS)] = items[ITEM];
    }

maybe, I am is wrong.
So, is there any other advise?

Some additional resources that may be of interest: 1 2 3

thank you, it match my expectation

I have verified that if every thread store 8 floats(32 bytes), bank conflicts occur, in your explanation, it should have 8 transactions, every 4 threads make up a transaction, but it is not true, I test on L20 card, please give some light, thanks!

bank conflicts will depend on the access pattern across threads. 32 bytes cannot be stored by a single thread in a single transaction (or in a single SASS instruction). So at the SASS level there will definitely be multiple transactions (instructions) involved, and depending on those instructions, some of those may be decomposed into multiple transactions.

more information would be needed about your case.

1 Like

Thanks, the following source code on L20 will cause bank conflict, you can verify use ncu:

#define N 32
#define factor 8

__global__ void bankConflictExample(float *input, float *output, int M) {
  // Declare shared memory (assuming 32 banks on the device)
  __shared__ float sharedMem[N*factor];  // Shared memory with 32 banks

  // Each thread loads one value into shared memory
  int tid = threadIdx.x;
  int base = blockIdx.x * N * factor;

  // All threads in the warp access shared memory with a stride of 2
  // This causes bank conflicts because addresses are strided
  // printf("tid %d input %hd\n", tid, input[tid]);
  for (int i = 0; i < factor; i++) {
    sharedMem[factor*tid + i] = input[base+factor*tid + i];
  }

  // Ensure all threads have finished writing to shared memory
  __syncthreads();

  for (int i = 0; i < factor; i++) {
    output[base+factor*tid+i] = sharedMem[factor*tid+i];
  }
}

int main() {
  int M = N * factor * 1;
  float h_input[M], h_output[M];

  // std::cout << "sizeof(float) " << sizeof(float) << std::endl;

  // Initialize input array
  for (int i = 0; i < M; i++) {
    h_input[i] = i;
  }

  int threadblock = N;

  float *d_input, *d_output;
  cudaMalloc(&d_input, M * sizeof(float));
  cudaMalloc(&d_output, M * sizeof(float));

  cudaMemcpy(d_input, h_input, M * sizeof(float), 
  cudaMemcpyHostToDevice);

  // Launch kernel with 1 block of 32 threads
  bankConflictExample<<<M / (N * factor), threadblock>>>(d_input, d_output, M);

  cudaMemcpy(h_output, d_output, M * sizeof(float), 
  cudaMemcpyDeviceToHost);
  cudaDeviceSynchronize();

  cudaFree(d_input);
  cudaFree(d_output);

  return 0;
}

Yes, your access pattern with a stride (greater than 1) will cause bank conflicts.

I don’t think I said anywhere that accessing multiple bytes per thread automatically eliminates the possibility of bank conflicts.

I don’t really understand what question is being asked, if any, and probably won’t be able to respond further.

Bank conflicts are expected based on access pattern. To avoid bank conflicts, the access pattern matters.

Sorry to bother you, I mean why every thread access 4 float won’t cause bank conflicts but if access 8 float it will cause bank conflicts, so you mean because the SASS instruction limitation make every 8 float access decompose into several transaction, so in this situation one transaction access same bank, and bank conflicts occur, thanks for your information, I will look at SASS code to verify.

update:
After look up generated SASS code, I found SASS code can’t tell me warp operation, the difference between factor = 4 and factor = 8 in SASS code is here:

      STS.128 [R0.X16], R8        \\  factor = 4
      BAR.SYNC.DEFER_BLOCKING 0x0

      STS.128 [R0], R8                \\ factor = 8
      STS.128 [R0+0x10], R12
      BAR.SYNC.DEFER_BLOCKING 0x0

So I guess because it has a lower bond to group some threads into a transaction, for example this lower bond is 8.

because it depends on the exact pattern of addresses across the warp, to determine bank conflicts. And in addition to 4 float vs. 8 float, there are addressing pattern differences.

understanding how shared bank conflicts arises depends on a certain amount of explanation. You can find such in unit 4 of this online tutorial series and many other places on the web.

In a nutshell, you must consider the address pattern across the warp on an instruction-by-instruction basis, and if each thread is loading (in a single instruction issue) more than 4 bytes, you must consider the address pattern on a transaction-by-transaction basis. Each transaction will be no more than 128 bytes warp-wide, that is; when the entire warp is requesting more than 128 bytes at the point of a single SASS instruction issue, then the memory controller will break that into multiple transactions.

The address pattern generated by each thread in the warp must be considered against the bank pattern. Shared memory is broken into banks, which can be thought of as columns if shared is arranged in a 2D format, where each column is 4 bytes wide and there are 32 columns. You can see pictorial examples of thinking/looking at shared that way, at the top of this thread.

When the addressing pattern is such that there is no more than one item needed per column, then you will have an not-bank-conflicted access. When the addressing pattern is such that there are two or more items needed in a single column, there will be bank conflicts.

When you write a code like this:

sharedMem[factor*tid + i] = input[base+factor*tid + i];

and factor == 1, then the address pattern across the warp is that each thread will have an address that is adjacent to its neighbor threads. This will result in one item per column needed in shared, and this will be non-bank-conflicted.

When factor is 2 or higher, you run into the possibility that multiple items will be needed in a single column. This will result in bank conflicts.

For example if factor is 2, on the first iteration of the loop (i is zero), the address index pattern is:

warp lane:  0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
index:      0  2  4  6  8 10 12 14 16 18  20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62

In this case we see that thread at warp lane zero is requesting item at index 0. Thread at warp lane 16 is requesting item at index 32, this is in the same shared column as index 0. This produces 2-way bank conflict.

This sort of analysis is necessary to statically determine bank conflicts in the general case. Its tedious, so I won’t do it repeatedly.

When I am teaching CUDA, I often mention that if you observe that an index is created using threadIdx.x as an additive factor only in the index creation, that will produce adjacent indexing across a warp, and that is canonically good for either coalescing considerations or bank-conflict considerations. Its not the only possible bank-conflict-free pattern, but it is one of them.

The first example I gave (factor == 1) fits this description. The second example I gave (factor == 2) does not fit this rubric.

3 Likes