Requesting clarification for Shared Memory Bank Conflicts and Shared memory access?

,

Assumption - 32 bit word size (data bus size). each bank is 32 bit wide. Also data to be operated upon is either 16 bit, 32 bit or 64 bit.

For the sake of making discussion easier let us visualize the hardware piece of shared memory as a 2 Dimensional array with M rows and 32 columns (32 banks) - such that M >=32 and each bank is 32 bit wide. So each row of shared memory is 32 bits * 32 = 4 bytes * 32 = 128 bytes.

So, from what I have read from multiple resources, bank conflicts occur when multiple threads in a given warp access different address location with in the same bank. Suppose for warp_0 all N (N=32) threads read or write to a different address in the same bank, then we have a N way bank conflict and further it was mentioned, this results in N serialized request. In context to bank conflicts a few scenarios come to my mind for which I need some clarification. Following are the scenarios.

Scenario 1 - why do N threads accessing different address space in the same bank lead to a serialized access?

it could be because of a few reasons that come to my mind. following is my thought process.

assumption :- data being operated is 32 bit, so fits perfectly in memory bank of width 32 bits.

reason (a) - My thought process - in each clock cycle, a warp consisting of 32 threads can only access in parallel, 32 different addresses in shared memory and assuming for the sake of this discussion, each of these 32 different addresses belong to 32 different banks. Hence in this case, threads of a WARP in a single clock cycle can only read a single row ( where each row is 128 bytes = 32 banks * 32 bits per bank) of shared memory. Therefore, if the threads of a warp access different addresses in the same bank, that is equivalent to only a single thread of a WARP getting executed per clock cycle. Hence, it leads to 32 threads of a warp executing in 32 clock cycles, which is equivalent to 32 serialized access.

Question - I am not sure if my above thought process is correct. I might be wrong. If I am wrong with my reasoning in reason(a), can you please correct me and help me fill in the missing gaps?

Scenario 2 -
Assumption - data being operated upon is 16 bits and shared memory bank’s width is 32 bits

So when the 16 bit data gets stored in shared memory, a memory bank of width 32 bits will be storing two data variable each of size 16 bits. Hence, it will lead to two threads accessing two different 16 bit data variable from two different addresses within the same 32 bit memory bank. This should lead to a two way bank conflict, since bank conflicts are defined as different threads accessing different address location with in the same bank.

Question (2.1) - if my understanding is incorrect, can you please correct me and fill in the missing gaps in my understanding?

Question (2.2.a) - Also, if I am correct in my understanding, then this 2 way bank conflict for 16 bit data can be avoided by padding it with 16 more bits. Padding it should avoid bank conflicts but does that have any kind of repercussion?

Question (2.2.b) - Also, if I am correct in my understanding, suppose we do not pad 16 bit data. In this scenario one row (128 bytes) of shared memory can hold 64 data elements each of size 16 bits. Then can threads from two warps in a single clock cycle read these 64 data variables each of size 16 bits. If this is true, then padding data and avoiding bank conflicts will only reduce performance. Again, if my understanding is incorrect, can you please correct me and fill in the missing gaps in my understanding?

Scenario 3 -
Assumption - data being operated upon is 64 bits, shared memory bank is of size 32 bits and assuming physical piece of shared memory as a 2 dimensional array of size M rows * 32 columns.

So when 32 data variables each of size 64 bit data gets stored in shared memory, two consecutive memory banks, each of size 32 bits will be storing a single data variable of 64 bits. Next, 32 threads of a warp will access 32 data variables, each of size 64 bits from the shared memory. These 32 data variables (each of size 64 bits) will require two rows of shared memory to store them. Hence, for one thread to access one data item of 64 bits from shared memory will have to access two consecutive memory banks each of width 32 bits. Hence it will cause two way bank conflicts because two threads will be accessing two different addresses in the same bank when operating upon 64 bit data.

Now a few have suggested this 2 way bank conflict can be avoided, if we pad 64 bit data with 32 more bits. so 64 bits with additional padded 32 bits will be 96 bits(12 bytes) data which will occupy three consecutive banks in shared memory, where each bank width is 32 bits. This strategy helps to avoid bank conflict. But by introducing padding one has to read 3 rows (96 bits * 32 = 128 bytes * 3) of shared memory.

Question - now in reason (a) I had the thought process that in one clock cycle threads of a warp (32 threads) read one row of shared memory (128 bytes = 32 banks * 32 bits per bank). so it we pad 64 bit data to 96 bits , reading 3 rows will take 3 clock cycles but avoids bank conflicts. But instead if we do not pad 64 bit data to 96 bits, we have only two read 2 rows (256 bytes = 32 * 64) which should be done in 2 clock cycle. Now I am not sure if my understanding is correct. And if I am wrong can you please fill in the gaps and correct me.?

Question - Also if I am right, then padding 64 bit data to 96 bits to avoid bank conflicts is one clock cycle expensive and hence in this scenario we should be okay with two way conflicts because padding avoids bank conflicts but does the work in 2 clock cycles instead of 3 clock cycles? Again I am not sure if my understanding is correct. I will appreciate if you can please correct me if I am wrong.

Robert will probably be along with a more complete reply, but here are a couple of things.

This needs amending to:
“bank conflicts occur when multiple threads in a given warp access different address location with in the same bank.”

Quoting from the "Best Practices Guide":

“On devices of compute capability 5.x or newer, each bank has a bandwidth of 32 bits every clock cycle, and successive 32-bit words are assigned to successive banks.”

So with two 16 bit elements packed into one 32bit bank and one element accessed from successive threads causes no conflict.

1 Like

I wouldn’t try to go very far with this. It is a stated performance limit based on the hardware design. It would be like asking “why can’t I issue two instructions to the same functional unit in the same clock cycle?” Because the hardware is not designed in a way that supports that.

Shared memory hardware supports reading or writing one 32 bit quantity per bank, per cycle. If you want to go a bit farther, I believe the 2D arrangement suggested is a helpful mental model - there may be something about the hardware design that creates a 2D effect, and all the storage locations for a column (a bank) have some shared hardware, like an output buffer, or an addressing input, which can only take on one value per cycle. But that isn’t written anywhere that I know of. The basic idea is that shared memory can read or write one 32 bit quantity per bank, per cycle, and that is due to the hardware design, and there isn’t any more information about it than that. It’s not clear that any additional info is need to properly educate the CUDA programmer in that respect.

We must keep in mind the broadcast rule for shared memory, which applies at least from cc3.0 forward (I think it appled to Fermi also). Two requests to the same bank and the same 32-bit location in that bank do not create a bank conflict. They invoke the broadcast rule which says that the request for that bank can be handled in a single cycle, regardless of how many threads in the warp in that cycle are requesting data from that location. We also don’t require all threads to access the data at a particular granularity or offset within the location. One thread can request the first 16-bits, and another thread can request the next 16-bits in that 32-bit location in that bank, and the broadcast rule applies.

We also require more clarity than you have provided in the access pattern. To discuss that case, we need to know if, for a given bank, the two 16-bit quantities are adjacent to each other, in which case they invoke the broadcast rule, or not adjacent to each other, in which case they would 2-way serialize, creating a 2-way bank conflict, considering only those two thread requests in that warp, in that cycle. Let’s consider an example. Thread 0 in a warp requests a 16 bit quantity starting at address (byte offset) 0. Thread 1 in the warp requests a 16-bit quantity starting at address (byte offset) 2. These both belong to the first bank, and both belong to the first location in that bank. The broadcast rule says that Thread 0 and Thread 1 can retrieve their quantities in the same shared access cycle. No serialization, no bank conflicts, considering just those 2 threads. Another example: Thread 0 requests a 16 bit quantity at byte offset 0, thread 1 requests a 16-bit quantity at byte offset 130. Both of those pertain to bank 0, but they don’t pertain to the same location in bank 0. Therefore, that will create 2-way serialization/2-way bank conflict, considering just the needs of those 2 threads.

There is an additional piece of info to properly interpret this case. I will assume that the addresses being requested warp-wide are adjacent and contiguous. For the reasons you state, you might initially conclude that such an access will necessarily generate a 2-way bank conflict, when each thread needs 64 bits. However we don’t refer to this case as a bank conflict. In general, the memory controller in a GPU when it encounters a request, considered warp-wide, that will require more than 128 bytes retrieved to satisfy the request, will break such request into 2 (or more) transactions. The memory controller never issues a single transaction to memory (global or shared) that is wider than 128 bytes. This is by HW design of the GPU. That’s how it works. I won’t be able to answer questions about “why is it that way?” Given that, however, the first 128 bytes of a 64-bit-per-thread-load will be issued in the first transaction, and second 128 bytes will be issued in a second transaction. Considering just these transaction, there are no bank conflicts for the case I outlined (adjacent/contiguous across the warp).

You’ve brought up padding twice now. There shouldn’t be any need to pad data for these cases, and padding wouldn’t provide any benefits. The GPU designers have thought about a lot of different types of access for shared memory, and in many cases have provided mechanisms so that performance doesn’t fall over for typical/intended use cases.

1 Like

@Robert_Crovella a big thanks for your detailed explanation. I have some follow up questions. I will appreciate if you can please help me with them.

Scenario_1 -

Assumption - Data element being operated upon is 16 bits and for a given bank, the two 16-bit quantities are adjacent to each other and also assuming shared memory is a 2 dimensional array of size M rows and 32 columns with M >=32.

Question - So 16 bits is equivalent to 2 bytes and one row of shared memory will be 128 bytes. So 64 elements, each of size 16 bits can be stored contiguous, next to each other in one row (128 bytes) of shared memory. Since 32 bits can be accessed from a bank, per cycle, two threads can concurrently access two, 16 bits elements stored adjacent to each other in the same bank. So can all the threads belonging to two different warps (64 different threads), access 64 elements (each element being 16 bits) in a single row of shared memory in one clock cycle? If this is possible, it should really make the entire process fast.

Scenario_2 -

Assumption - Data element being operated upon is 64 bits and every element is stored adjacent to each other. also addresses being requested warp-wide are adjacent and contiguous and assuming shared memory is a 2 dimensional array of size M rows and 32 columns with M >=32.

Question - The 64 bit element are stored adjacent to each other. Hence, every 64 bit element is stored in adjacent banks. So 16 threads of a warp read a row (128 bytes, 16 elements) of shared memory in one clock cycle. So in one clock cycle only 16 threads of a WARP are active and rest 16 threads of a WARP are inactive. Can we better it somehow by making the other 16 inactive threads of a WARP do something. Or this is the best we can do with 64 bit data and hence 16 threads of a WARP will remain inactive during for a period of one clock cycle?

AFAIK, no. I don’t believe the requests from two separate warps can be combined into a single request for shared memory access. However a similar or related question might be “what is the bandwidth of shared memory?” I mentioned shared can serve one 32-bit quantity per bank per cycle, but I mostly had a single request in view here. I do not happen to know offhand if shared memory can serve two separate requests (lets say from two different warp schedulers/SMSP) in the same cycle. I guess if I wanted to answer that question I would scour the various whitepapers and also high quality microbenchmarking such as that done by Citadel group, in order to see if I could find any clues. My suspicion, however, is that post-Kepler, the maximum bandwidth of shared is one 32-bit quantity per bank per cycle.

What did you have in mind by “do something”? Remember that all current CUDA GPUs issue the same instruction to all threads in a warp. If any thread is doing a shared read, they all are. This is true independent of convergence status. Even with warp divergence, even with the volta execution model, the statement is true, albeit some threads may be “disabled”/“inactive”. So what exactly do you mean by “do something”? If the first 16 threads are doing a shared read, can the other 16 threads do something that is not a shared read, in the same cycle? No, they cannot.

1 Like

@Robert_Crovella Thank you, that answers my questions. And I will dig through whitepapers and microbenchmarking by citadel group. I started with CUDA, a couple of months ago. And I had all these questions in my mind but no one to discuss them with. You have been a great help. Thank you.