No clear concise data on GPU shared memory bank layout

calvingloster · May 17, 2024, 9:38am

I am trying to understand the bank layout of shared memory for the TU106 architecture, or any GPU for that matter. I don’t know why but it seems like the information of how the banks are organized is just not available? Are they 32 bits wide or 32 bytes? are banks vertical or horizontal. I thought they are 32bits wide which makes sense so that if i have a warp each thread in the warp can write a 4 byte value to each bank without a conflict, well i was wrong. This is exactly what i do and i get multiple bank conflicts. that made me thought maybe the banks are vertical. If they are vertical and only 32bits wide then any write to shared memory would result in a bank conflict. I read a post where someone showed the banks as vertical. Others show them as horizontal. Some say 32bits, some say 32 bytes.

The first thing is, if I’m only storing 4byte floats into shared memory and they are 32 bits wide and the banks are horizontal then how can i get a bank conflict?

If anyone can point me towards or tell me what the layout of the banks are that will be great. I get that horizontal or vertical is relative here and has no meaning.

Lets say the memory bank is arranged vertically, bank0 at the top and is 32bits wide, directly below is bank 1 all the way to bank 31. It must then wrap to the top again and write to bank 0 again? if this is then i have two further questions.

2a. Why do people say the bank is 32bits wide then. Its not 32 bits wide!!! lets say shared memory is 1024bytes. We would have 32 columns and 32 rows. Bank 0 is then 1024 bits wide or 128bytes wide. Not 32bits.

2b. If the banks are structured like this, how is contiguous memory writes performed, lets say i write 5 floats of 4 bytes each. The first one goes in row 0 column 0, the second one? does it get written to row 1 column 0 or row 0 column 1?

striker159 · May 17, 2024, 1:21pm

Shared memory is explained here, for example.

Curefab · May 17, 2024, 5:59pm

The Turing shared memory is organized as 32 banks, each bank reading or writing 32 bits per clock cycle.

For __shared__ int var[1024];

var[0], var[32], var[64] are in bank 0,
var[1], var[33], var[65] are in bank 1.
var[2], var[34], var[66] are in bank 2.
…

2D arrays are folded 1D arrays. For those it depends on the size of the dimensions and whether they are divisible by 32:

For __shared__ int var[16][64];

var[0][0], var[0][32], var[1][0] are in bank 0,
var[0][1], var[0][33], var[1][1] are in bank 1.
var[0][2], var[0][34], var[1][2] are in bank 2.
…

For __shared__ int var[32][33];

var[0][0], var[0][32], var[1][31] are in bank 0,
var[0][1], var[1][0], var[1][32] are in bank 1.
var[0][2], var[1][1], var[2][0] are in bank 2.
…
This last layout compared to the one before makes it possible to access row-wise and column-wise without bank conflicts.

rs277 · May 17, 2024, 6:49pm

Figure 3 in this blog post illustrates the layout, along with methods to avoid conflicts.

Topic		Replies	Views
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	11	3467	August 20, 2009
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2629	March 31, 2010
bank conflict in cuda's parallel prefix scan GPU-Accelerated Libraries	1	1889	February 12, 2016
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	974	November 17, 2017
Does every thread block have its own 32 shared memory banks? CUDA Programming and Performance cuda	8	1624	February 6, 2023
question about the shared memory CUDA Programming and Performance	4	3865	October 30, 2007
Is it possible to get detailed info about GPU memory specifically shared memory? Nsight Compute	3	678	May 17, 2023
Question about shared memory banking and conflicting reading... CUDA Programming and Performance	2	856	March 22, 2015
Share memory and banks CUDA Programming and Performance	1	3243	August 5, 2009
How are types larger than 4 bytes stored in shared memory, and how does this relate to bank conflicts CUDA Programming and Performance	7	49	April 20, 2025

No clear concise data on GPU shared memory bank layout

Related topics