Assumption -
(a)For the sake of making discussion easier let us visualize the hardware piece of shared memory as a 2 Dimensional array with M rows and 32 columns (32 banks) - such that M >=32 and each bank is 32 bit wide. So each row of shared memory is 32 bits * 32 = 4 bytes * 32 = 128 bytes.
(b) Data being operated upon is 32 bit,
(c) considering only the 32 threads of the first warp,
(d) also assuming 1D blocks,
(e) thread_id → represents thread id . For example thread_0 represents thread with ID 0, thread_1 represents thread with id 1 and so on.
Scenario_A - Visualizing shared memory as a 2D array - shared_memory[M][32] (M rows and 32 columns). Threads of the same warp, thread_0 to thread_1 and from thread_3 to thread_31 belonging to warp_0 access shared_memory[0][thread_id] respectively. That is thread_0 accesses shared_memory[0][0], thread_1 accesses shared_memory[0][1], thread_3 accesses shared_memory[0][3] and so on. Also thread_2 of warp_0, accesses shared_memory[1][2]. So, there are no bank conflicts since all threads of warp_0 access different banks and leaving thread_2, all threads of the warp access contiguous and adjacent banks.
Question that I am trying to figure out → When it comes to performance is the Scenario_A similar in performance to a two way bank conflict? And while thinking about it two thought processes come to my mind, My_thought_process_A_1 and My_thought_process_A_2 mentioned below.
My_thought_process_A_1 → Memory controller at most issues an 128 byte transactions (read or write) per clock cycle. And also these 128 bytes read or written in a single transaction are adjacent and contiguous. Hence when thread_0 request a 32 bit READ (at shared_memory[0][0]) from shared memory at byte offset 0, the memory controller will issue an contiguous 128 bytes of read transaction in that clock cycle. And lets name this transaction as transaction_1. Hence, in the current clock cycle, transaction_1 gets issued, it will satisfy READ request from thread_0 to thread_1 and from thread_3 to thread_31 belonging to warp_0. But since thread_2 of warp_0 accesses shared_memory[1][2], and since thread_2’s READ request does not get satisfied by transaction_1, the memory controller in a different clock cycle issues another READ transaction namely transaction_2 and it will read another 128 contiguous byte which satisfies the READ request by thread_2.
Observation_A_1 → In context to the performance, does Scenario_A require 2 clock cycles to complete similar to a two way bank conflict? In a two way bank conflict scenario, accessing data from shared memory requires two clock cycles, since the memory controller has to issue two transactions in two different cycles. So from the reasoning described in my My_thought_process_A_1, for Scenario_A, even when there are no bank conflicts, memory controller has to issue two transactions in two different cycles which is similar to 2 clock cycles required when two way bank conflicts occurs.
Question_A_1 → I might be totally wrong in My_thought_process_A_1 and Observation_A_1. If I am wrong above, I will appreciate if you can please correct me and fill in the gaps?
Also it is mentioned in this Shared Memory blog → “Shared memory bandwidth is 32 bits per bank per clock cycle. Any memory load or store of n addresses that spans b distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is b times as high as the bandwidth of a single bank”.
My_thought_process_A_2 → So can I interpret the above statement as " if there are no bank conflicts, even non contiguous access(read or write) to shared memory by 32 threads of a warp can be processed in the same clock cycle". So considering this interpretation, can we assume for Scenario_A, the memory controller in the background issues a instruction that reads 32 bits requested by thread_2 at shared_memory[1][2] in the same clock cycle as it reads 32 bits each at shared_memory[0][thread_id] where thread_id belongs to [0, …, 31] excluding [2] and including [0,31], 128 bytes in total.
Basically does the memory controller in one instruction, in the same clock cycle reads 128 bytes of data ( row 0th of shared_memory minus the third element(shred_memory[0][2]) plus third element from first row of shared memory (shared_memory[1][2]) ) .
Observation_A_2 → So, if the above is true then shared memory access for Scenario_A provides better performance as compared to two way bank conflict, since two way bank conflict will require 2 clock cycles to complete transaction while if My_thought_process_A_2 is correct, Scenario_A will only require one clock cycle to complete transactions.
Question_A_2 → Again, I might be totally wrong in My_thought_process_A_2 and Observation_A_2. If I am wrong above, I will appreciate if you can please correct me and fill in the gaps?