Requesting clarification for Shared Memory Bank Conflicts and Shared memory access?

,

Assumption - 32 bit word size (data bus size). each bank is 32 bit wide. Also data to be operated upon is either 16 bit, 32 bit or 64 bit.

For the sake of making discussion easier let us visualize the hardware piece of shared memory as a 2 Dimensional array with M rows and 32 columns (32 banks) - such that M >=32 and each bank is 32 bit wide. So each row of shared memory is 32 bits * 32 = 4 bytes * 32 = 128 bytes.

So, from what I have read from multiple resources, bank conflicts occur when multiple threads in a given warp access different address location with in the same bank. Suppose for warp_0 all N (N=32) threads read or write to a different address in the same bank, then we have a N way bank conflict and further it was mentioned, this results in N serialized request. In context to bank conflicts a few scenarios come to my mind for which I need some clarification. Following are the scenarios.

Scenario 1 - why do N threads accessing different address space in the same bank lead to a serialized access?

it could be because of a few reasons that come to my mind. following is my thought process.

assumption :- data being operated is 32 bit, so fits perfectly in memory bank of width 32 bits.

reason (a) - My thought process - in each clock cycle, a warp consisting of 32 threads can only access in parallel, 32 different addresses in shared memory and assuming for the sake of this discussion, each of these 32 different addresses belong to 32 different banks. Hence in this case, threads of a WARP in a single clock cycle can only read a single row ( where each row is 128 bytes = 32 banks * 32 bits per bank) of shared memory. Therefore, if the threads of a warp access different addresses in the same bank, that is equivalent to only a single thread of a WARP getting executed per clock cycle. Hence, it leads to 32 threads of a warp executing in 32 clock cycles, which is equivalent to 32 serialized access.

Question - I am not sure if my above thought process is correct. I might be wrong. If I am wrong with my reasoning in reason(a), can you please correct me and help me fill in the missing gaps?

Scenario 2 -
Assumption - data being operated upon is 16 bits and shared memory bank’s width is 32 bits

So when the 16 bit data gets stored in shared memory, a memory bank of width 32 bits will be storing two data variable each of size 16 bits. Hence, it will lead to two threads accessing two different 16 bit data variable from two different addresses within the same 32 bit memory bank. This should lead to a two way bank conflict, since bank conflicts are defined as different threads accessing different address location with in the same bank.

Question (2.1) - if my understanding is incorrect, can you please correct me and fill in the missing gaps in my understanding?

Question (2.2.a) - Also, if I am correct in my understanding, then this 2 way bank conflict for 16 bit data can be avoided by padding it with 16 more bits. Padding it should avoid bank conflicts but does that have any kind of repercussion?

Question (2.2.b) - Also, if I am correct in my understanding, suppose we do not pad 16 bit data. In this scenario one row (128 bytes) of shared memory can hold 64 data elements each of size 16 bits. Then can threads from two warps in a single clock cycle read these 64 data variables each of size 16 bits. If this is true, then padding data and avoiding bank conflicts will only reduce performance. Again, if my understanding is incorrect, can you please correct me and fill in the missing gaps in my understanding?

Scenario 3 -
Assumption - data being operated upon is 64 bits, shared memory bank is of size 32 bits and assuming physical piece of shared memory as a 2 dimensional array of size M rows * 32 columns.

So when 32 data variables each of size 64 bit data gets stored in shared memory, two consecutive memory banks, each of size 32 bits will be storing a single data variable of 64 bits. Next, 32 threads of a warp will access 32 data variables, each of size 64 bits from the shared memory. These 32 data variables (each of size 64 bits) will require two rows of shared memory to store them. Hence, for one thread to access one data item of 64 bits from shared memory will have to access two consecutive memory banks each of width 32 bits. Hence it will cause two way bank conflicts because two threads will be accessing two different addresses in the same bank when operating upon 64 bit data.

Now a few have suggested this 2 way bank conflict can be avoided, if we pad 64 bit data with 32 more bits. so 64 bits with additional padded 32 bits will be 96 bits(12 bytes) data which will occupy three consecutive banks in shared memory, where each bank width is 32 bits. This strategy helps to avoid bank conflict. But by introducing padding one has to read 3 rows (96 bits * 32 = 128 bytes * 3) of shared memory.

Question - now in reason (a) I had the thought process that in one clock cycle threads of a warp (32 threads) read one row of shared memory (128 bytes = 32 banks * 32 bits per bank). so it we pad 64 bit data to 96 bits , reading 3 rows will take 3 clock cycles but avoids bank conflicts. But instead if we do not pad 64 bit data to 96 bits, we have only two read 2 rows (256 bytes = 32 * 64) which should be done in 2 clock cycle. Now I am not sure if my understanding is correct. And if I am wrong can you please fill in the gaps and correct me.?

Question - Also if I am right, then padding 64 bit data to 96 bits to avoid bank conflicts is one clock cycle expensive and hence in this scenario we should be okay with two way conflicts because padding avoids bank conflicts but does the work in 2 clock cycles instead of 3 clock cycles? Again I am not sure if my understanding is correct. I will appreciate if you can please correct me if I am wrong.

2 Likes

Robert will probably be along with a more complete reply, but here are a couple of things.

This needs amending to:
“bank conflicts occur when multiple threads in a given warp access different address location with in the same bank.”

Quoting from the "Best Practices Guide":

“On devices of compute capability 5.x or newer, each bank has a bandwidth of 32 bits every clock cycle, and successive 32-bit words are assigned to successive banks.”

So with two 16 bit elements packed into one 32bit bank and one element accessed from successive threads causes no conflict.

2 Likes

I wouldn’t try to go very far with this. It is a stated performance limit based on the hardware design. It would be like asking “why can’t I issue two instructions to the same functional unit in the same clock cycle?” Because the hardware is not designed in a way that supports that.

Shared memory hardware supports reading or writing one 32 bit quantity per bank, per cycle. If you want to go a bit farther, I believe the 2D arrangement suggested is a helpful mental model - there may be something about the hardware design that creates a 2D effect, and all the storage locations for a column (a bank) have some shared hardware, like an output buffer, or an addressing input, which can only take on one value per cycle. But that isn’t written anywhere that I know of. The basic idea is that shared memory can read or write one 32 bit quantity per bank, per cycle, and that is due to the hardware design, and there isn’t any more information about it than that. It’s not clear that any additional info is need to properly educate the CUDA programmer in that respect.

We must keep in mind the broadcast rule for shared memory, which applies at least from cc3.0 forward (I think it appled to Fermi also). Two requests to the same bank and the same 32-bit location in that bank do not create a bank conflict. They invoke the broadcast rule which says that the request for that bank can be handled in a single cycle, regardless of how many threads in the warp in that cycle are requesting data from that location. We also don’t require all threads to access the data at a particular granularity or offset within the location. One thread can request the first 16-bits, and another thread can request the next 16-bits in that 32-bit location in that bank, and the broadcast rule applies.

We also require more clarity than you have provided in the access pattern. To discuss that case, we need to know if, for a given bank, the two 16-bit quantities are adjacent to each other, in which case they invoke the broadcast rule, or not adjacent to each other, in which case they would 2-way serialize, creating a 2-way bank conflict, considering only those two thread requests in that warp, in that cycle. Let’s consider an example. Thread 0 in a warp requests a 16 bit quantity starting at address (byte offset) 0. Thread 1 in the warp requests a 16-bit quantity starting at address (byte offset) 2. These both belong to the first bank, and both belong to the first location in that bank. The broadcast rule says that Thread 0 and Thread 1 can retrieve their quantities in the same shared access cycle. No serialization, no bank conflicts, considering just those 2 threads. Another example: Thread 0 requests a 16 bit quantity at byte offset 0, thread 1 requests a 16-bit quantity at byte offset 130. Both of those pertain to bank 0, but they don’t pertain to the same location in bank 0. Therefore, that will create 2-way serialization/2-way bank conflict, considering just the needs of those 2 threads.

There is an additional piece of info to properly interpret this case. I will assume that the addresses being requested warp-wide are adjacent and contiguous. For the reasons you state, you might initially conclude that such an access will necessarily generate a 2-way bank conflict, when each thread needs 64 bits. However we don’t refer to this case as a bank conflict. In general, the memory controller in a GPU when it encounters a request, considered warp-wide, that will require more than 128 bytes retrieved to satisfy the request, will break such request into 2 (or more) transactions. The memory controller never issues a single transaction to memory (global or shared) that is wider than 128 bytes. This is by HW design of the GPU. That’s how it works. I won’t be able to answer questions about “why is it that way?” Given that, however, the first 128 bytes of a 64-bit-per-thread-load will be issued in the first transaction, and second 128 bytes will be issued in a second transaction. Considering just these transaction, there are no bank conflicts for the case I outlined (adjacent/contiguous across the warp).

You’ve brought up padding twice now. There shouldn’t be any need to pad data for these cases, and padding wouldn’t provide any benefits. The GPU designers have thought about a lot of different types of access for shared memory, and in many cases have provided mechanisms so that performance doesn’t fall over for typical/intended use cases.

5 Likes

@Robert_Crovella a big thanks for your detailed explanation. I have some follow up questions. I will appreciate if you can please help me with them.

Scenario_1 -

Assumption - Data element being operated upon is 16 bits and for a given bank, the two 16-bit quantities are adjacent to each other and also assuming shared memory is a 2 dimensional array of size M rows and 32 columns with M >=32.

Question - So 16 bits is equivalent to 2 bytes and one row of shared memory will be 128 bytes. So 64 elements, each of size 16 bits can be stored contiguous, next to each other in one row (128 bytes) of shared memory. Since 32 bits can be accessed from a bank, per cycle, two threads can concurrently access two, 16 bits elements stored adjacent to each other in the same bank. So can all the threads belonging to two different warps (64 different threads), access 64 elements (each element being 16 bits) in a single row of shared memory in one clock cycle? If this is possible, it should really make the entire process fast.

Scenario_2 -

Assumption - Data element being operated upon is 64 bits and every element is stored adjacent to each other. also addresses being requested warp-wide are adjacent and contiguous and assuming shared memory is a 2 dimensional array of size M rows and 32 columns with M >=32.

Question - The 64 bit element are stored adjacent to each other. Hence, every 64 bit element is stored in adjacent banks. So 16 threads of a warp read a row (128 bytes, 16 elements) of shared memory in one clock cycle. So in one clock cycle only 16 threads of a WARP are active and rest 16 threads of a WARP are inactive. Can we better it somehow by making the other 16 inactive threads of a WARP do something. Or this is the best we can do with 64 bit data and hence 16 threads of a WARP will remain inactive during for a period of one clock cycle?

AFAIK, no. I don’t believe the requests from two separate warps can be combined into a single request for shared memory access. However a similar or related question might be “what is the bandwidth of shared memory?” I mentioned shared can serve one 32-bit quantity per bank per cycle, but I mostly had a single request in view here. I do not happen to know offhand if shared memory can serve two separate requests (lets say from two different warp schedulers/SMSP) in the same cycle. I guess if I wanted to answer that question I would scour the various whitepapers and also high quality microbenchmarking such as that done by Citadel group, in order to see if I could find any clues. My suspicion, however, is that post-Kepler, the maximum bandwidth of shared is one 32-bit quantity per bank per cycle.

What did you have in mind by “do something”? Remember that all current CUDA GPUs issue the same instruction to all threads in a warp. If any thread is doing a shared read, they all are. This is true independent of convergence status. Even with warp divergence, even with the volta execution model, the statement is true, albeit some threads may be “disabled”/“inactive”. So what exactly do you mean by “do something”? If the first 16 threads are doing a shared read, can the other 16 threads do something that is not a shared read, in the same cycle? No, they cannot.

3 Likes

@Robert_Crovella Thank you, that answers my questions. And I will dig through whitepapers and microbenchmarking by citadel group. I started with CUDA, a couple of months ago. And I had all these questions in my mind but no one to discuss them with. You have been a great help. Thank you.

Thanks tavivekuh and Roert, your insightful question and answer help me a lot. I understand that the 16-bit access to one bank. However, I have a further question regarding a scenario in which multiple 8-bit/16-bit accesses to shared memory occur. Suppose a thread access 8-bit data at a time. If four threads access different locations within the same bank, this would lead to a bank conflict and result in utilizing only 8/32 of the shared memory bandwidth. Why is there no support for more fine-grained access to shared memory? I am interested in understanding the hardware implementation limitations of 8-bit shared memory bandwidth.

1 Like

The limitations can be directly inferred from the statements already provided. If two (or more) threads in a warp, for a single warp-issued instruction, access different 32-bit locations in the same bank, regardless of their access width (8-bit, 16-bit, or 32-bit), then that will result in a bank conflict.

However there is the broadcast rule. Two or more threads accessing the same location do not generate a bank conflict. This is also true regardless of the access width (8-bit, 16-bit, or 32-bit) and is even true if, for example, different 8-bit locations in the same 32-bit word are being accessed by different threads.

Coupled with that, as already stated, the maximum bandwidth of shared memory (I believe, post-Kepler) is one 32 bit quantity/location per bank, per cycle (per SM). Accessing 8-bit quantities or 16-bit quantities per thread will necessarily reduce the maximum achievable bandwidth by a factor of one-half (16-bit) or one-quarter (8-bit).

There are also other chip-specific factors which may impact whether a full bandwidth of 128 bytes per access (32 threads in a warp times 32 bits per thread) can be achieved.

I’m not sure what you mean. 8-bit or 16-bit access is supported in a fashion similar to 32-bit, as already covered in my comments above. At some point, I cannot proceed any further with questions of the form “why is it this way?” I have given a behavioral description. Beyond that, I will eventually end up at the answer “because that is the way the GPU designers chose to design it”.

2 Likes

Thank you. More fine-grained access means 32-bit x 32 banks → 16-bit x 64 banks or 8-bit x 128 banks. Of course, the most common cases are definitely 16-bit and 32-bit access. And as you say, this is the decision of GPU designers.

I have ignored some details in hardware implementation of shared memory. To support more banks, there must be more complex design for cross bar, leading the extra overhead.

The logical model of shared memory as a 2D array runs into complications as it misses the key that each bank can access a different row. It may be better to think of it as 32 (banks=column) separate arrays of N 32-bit value (1-wide row). On each cycle each bank can read/write one of the N values. The three key elements to shared memory are:

  1. Conflict resolver - Each cycle the resolver determines the maximum set of threads that can read banks without a conflict.
  2. Banks - 32 independent 32-bit wide banks.
  3. Crossbar - Allows each warp lane to select any one of the 32 banks.

EXAMPLE : Load from Shared Memory

  1. The shared memory unit receives a load instruction from a SM sub-partition. The unit removes all inactive threads (lanes) and all threads (lanes) that are predicated off from further processing.
  2. If instruction uses generic address (LD, ST, ATOM) then convert from 32-bit or 64-bit generic address to shared memory offset by subtracting the generic shared memory window base address.
  3. For all valid lanes check offset is with valid shared memory size. If any address is out of bounds then throw hardware exception.
  4. For all valid lanes calculate the bank = (offset >> 2) & 0x1F
    offset[1:0] is the byte offset in the bank
    offset[6:2] is the bank #
    offset[n:7] is the bank row
  5. For all valid lanes calculate the bank row = (offset & !0x3F)
  6. RESOLVER while (valid_lanes) // create wavefronts until all valid lanes have access shared memory
    // for 32-bit read/write this will at worst be 32 loop iterations

6a. From lowest valid lane (thread) to highest set banks

    uint32_t valid_lanes                = STEP1
    uint32_t lane_bank[MAX_LANES]       = STEP4
    uint32_t lane_bank_row[MAX_LANES]   = STEP5
    
    uint32_t bank_row[MAX_BANKS] = {UNDEFINED}
    for (i = 0; i < MAX_LANES; ++i)
    {
        bank = lane_bank[i]
        row  = lane_bank_rows[i]
        uint32_t bank_lane_mask[MAX_BANKS] = 0
        if (bank_row[bank] == UNDEFINED     // first lane to access bank
            || bank_row[bank] == row)       // !first lane but matching row
        {
            bank_row[bank] = row;
            bank_lane_mask[bank] |= 1 << i;
            // remove lane from valid lanes as it will be completed this wavefronts
            valid_lanes &= ~(1 << i);
        }
     }

6b. BANKS Each bank reads 32-bits from bank_row.
6c. CROSSBAR The shared memory crossbar is configured to output the bank data to the correct threads defined by the bank_lane_masks.
6d. The per thread data is written to a per sub-partition FIFO. The FIFO data is written back to the sub-partition register file.
6e. When all data has reached the register file the instruction is marked as retired. This step will also resolve any scoreboard blocking dependent instructions.

The shared memory unit processes warp instructions.

Supporting …

  • More banks would increase the area and latency of both the resolver and crossbar. The write and read path is multiples of 32-bits (matching registers).
  • Multiple warps would greatly increase the complexity of crossbar as the instruction meta data for sub-partition and RF location would need to be per lane.
  • Multiple warps would greatly increase the implementation details of store conflicts and atomics.
  • Multiple warps will greatly increase bank arbitration conflicts. On GV100+ the shared memory, load/store, and texture path all use the same SRAM so there is already bank arbitration by multiple clients.

All access are 32-bit. 64-bit is just 2 banks of 32-bit. The truncation and shift to 8-bit or 16-bit value can be done in the write-back path.

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.