Assumption - 32 bit word size (data bus size). each bank is 32 bit wide. Also data to be operated upon is either 16 bit, 32 bit or 64 bit.
For the sake of making discussion easier let us visualize the hardware piece of shared memory as a 2 Dimensional array with M rows and 32 columns (32 banks) - such that M >=32 and each bank is 32 bit wide. So each row of shared memory is 32 bits * 32 = 4 bytes * 32 = 128 bytes.
So, from what I have read from multiple resources, bank conflicts occur when multiple threads in a given warp access different address location with in the same bank. Suppose for warp_0 all N (N=32) threads read or write to a different address in the same bank, then we have a N way bank conflict and further it was mentioned, this results in N serialized request. In context to bank conflicts a few scenarios come to my mind for which I need some clarification. Following are the scenarios.
Scenario 1 - why do N threads accessing different address space in the same bank lead to a serialized access?
it could be because of a few reasons that come to my mind. following is my thought process.
assumption :- data being operated is 32 bit, so fits perfectly in memory bank of width 32 bits.
reason (a) - My thought process - in each clock cycle, a warp consisting of 32 threads can only access in parallel, 32 different addresses in shared memory and assuming for the sake of this discussion, each of these 32 different addresses belong to 32 different banks. Hence in this case, threads of a WARP in a single clock cycle can only read a single row ( where each row is 128 bytes = 32 banks * 32 bits per bank) of shared memory. Therefore, if the threads of a warp access different addresses in the same bank, that is equivalent to only a single thread of a WARP getting executed per clock cycle. Hence, it leads to 32 threads of a warp executing in 32 clock cycles, which is equivalent to 32 serialized access.
Question - I am not sure if my above thought process is correct. I might be wrong. If I am wrong with my reasoning in reason(a), can you please correct me and help me fill in the missing gaps?
Scenario 2 -
Assumption - data being operated upon is 16 bits and shared memory bank’s width is 32 bits
So when the 16 bit data gets stored in shared memory, a memory bank of width 32 bits will be storing two data variable each of size 16 bits. Hence, it will lead to two threads accessing two different 16 bit data variable from two different addresses within the same 32 bit memory bank. This should lead to a two way bank conflict, since bank conflicts are defined as different threads accessing different address location with in the same bank.
Question (2.1) - if my understanding is incorrect, can you please correct me and fill in the missing gaps in my understanding?
Question (2.2.a) - Also, if I am correct in my understanding, then this 2 way bank conflict for 16 bit data can be avoided by padding it with 16 more bits. Padding it should avoid bank conflicts but does that have any kind of repercussion?
Question (2.2.b) - Also, if I am correct in my understanding, suppose we do not pad 16 bit data. In this scenario one row (128 bytes) of shared memory can hold 64 data elements each of size 16 bits. Then can threads from two warps in a single clock cycle read these 64 data variables each of size 16 bits. If this is true, then padding data and avoiding bank conflicts will only reduce performance. Again, if my understanding is incorrect, can you please correct me and fill in the missing gaps in my understanding?
Scenario 3 -
Assumption - data being operated upon is 64 bits, shared memory bank is of size 32 bits and assuming physical piece of shared memory as a 2 dimensional array of size M rows * 32 columns.
So when 32 data variables each of size 64 bit data gets stored in shared memory, two consecutive memory banks, each of size 32 bits will be storing a single data variable of 64 bits. Next, 32 threads of a warp will access 32 data variables, each of size 64 bits from the shared memory. These 32 data variables (each of size 64 bits) will require two rows of shared memory to store them. Hence, for one thread to access one data item of 64 bits from shared memory will have to access two consecutive memory banks each of width 32 bits. Hence it will cause two way bank conflicts because two threads will be accessing two different addresses in the same bank when operating upon 64 bit data.
Now a few have suggested this 2 way bank conflict can be avoided, if we pad 64 bit data with 32 more bits. so 64 bits with additional padded 32 bits will be 96 bits(12 bytes) data which will occupy three consecutive banks in shared memory, where each bank width is 32 bits. This strategy helps to avoid bank conflict. But by introducing padding one has to read 3 rows (96 bits * 32 = 128 bytes * 3) of shared memory.
Question - now in reason (a) I had the thought process that in one clock cycle threads of a warp (32 threads) read one row of shared memory (128 bytes = 32 banks * 32 bits per bank). so it we pad 64 bit data to 96 bits , reading 3 rows will take 3 clock cycles but avoids bank conflicts. But instead if we do not pad 64 bit data to 96 bits, we have only two read 2 rows (256 bytes = 32 * 64) which should be done in 2 clock cycle. Now I am not sure if my understanding is correct. And if I am wrong can you please fill in the gaps and correct me.?
Question - Also if I am right, then padding 64 bit data to 96 bits to avoid bank conflicts is one clock cycle expensive and hence in this scenario we should be okay with two way conflicts because padding avoids bank conflicts but does the work in 2 clock cycles instead of 3 clock cycles? Again I am not sure if my understanding is correct. I will appreciate if you can please correct me if I am wrong.