No clear concise data on GPU shared memory bank layout

I am trying to understand the bank layout of shared memory for the TU106 architecture, or any GPU for that matter. I don’t know why but it seems like the information of how the banks are organized is just not available? Are they 32 bits wide or 32 bytes? are banks vertical or horizontal. I thought they are 32bits wide which makes sense so that if i have a warp each thread in the warp can write a 4 byte value to each bank without a conflict, well i was wrong. This is exactly what i do and i get multiple bank conflicts. that made me thought maybe the banks are vertical. If they are vertical and only 32bits wide then any write to shared memory would result in a bank conflict. I read a post where someone showed the banks as vertical. Others show them as horizontal. Some say 32bits, some say 32 bytes.

  1. The first thing is, if I’m only storing 4byte floats into shared memory and they are 32 bits wide and the banks are horizontal then how can i get a bank conflict?

If anyone can point me towards or tell me what the layout of the banks are that will be great. I get that horizontal or vertical is relative here and has no meaning.

  1. Lets say the memory bank is arranged vertically, bank0 at the top and is 32bits wide, directly below is bank 1 all the way to bank 31. It must then wrap to the top again and write to bank 0 again? if this is then i have two further questions.

2a. Why do people say the bank is 32bits wide then. Its not 32 bits wide!!! lets say shared memory is 1024bytes. We would have 32 columns and 32 rows. Bank 0 is then 1024 bits wide or 128bytes wide. Not 32bits.

2b. If the banks are structured like this, how is contiguous memory writes performed, lets say i write 5 floats of 4 bytes each. The first one goes in row 0 column 0, the second one? does it get written to row 1 column 0 or row 0 column 1?

Shared memory is explained here, for example.

The Turing shared memory is organized as 32 banks, each bank reading or writing 32 bits per clock cycle.

For __shared__ int var[1024];

  • var[0], var[32], var[64] are in bank 0,
  • var[1], var[33], var[65] are in bank 1.
  • var[2], var[34], var[66] are in bank 2.

2D arrays are folded 1D arrays. For those it depends on the size of the dimensions and whether they are divisible by 32:

For __shared__ int var[16][64];

  • var[0][0], var[0][32], var[1][0] are in bank 0,
  • var[0][1], var[0][33], var[1][1] are in bank 1.
  • var[0][2], var[0][34], var[1][2] are in bank 2.

For __shared__ int var[32][33];

  • var[0][0], var[0][32], var[1][31] are in bank 0,
  • var[0][1], var[1][0], var[1][32] are in bank 1.
  • var[0][2], var[1][1], var[2][0] are in bank 2.

  • This last layout compared to the one before makes it possible to access row-wise and column-wise without bank conflicts.

Figure 3 in this blog post illustrates the layout, along with methods to avoid conflicts.