Requesting clarification for Shared Memory Bank Conflicts and Shared memory access?

Greg · January 23, 2024, 5:25am

The logical model of shared memory as a 2D array runs into complications as it misses the key that each bank can access a different row. It may be better to think of it as 32 (banks=column) separate arrays of N 32-bit value (1-wide row). On each cycle each bank can read/write one of the N values. The three key elements to shared memory are:

Conflict resolver - Each cycle the resolver determines the maximum set of threads that can read banks without a conflict.
Banks - 32 independent 32-bit wide banks.
Crossbar - Allows each warp lane to select any one of the 32 banks.

EXAMPLE : Load from Shared Memory

The shared memory unit receives a load instruction from a SM sub-partition. The unit removes all inactive threads (lanes) and all threads (lanes) that are predicated off from further processing.
If instruction uses generic address (LD, ST, ATOM) then convert from 32-bit or 64-bit generic address to shared memory offset by subtracting the generic shared memory window base address.
For all valid lanes check offset is with valid shared memory size. If any address is out of bounds then throw hardware exception.
For all valid lanes calculate the bank = (offset >> 2) & 0x1F
offset[1:0] is the byte offset in the bank
offset[6:2] is the bank #
offset[n:7] is the bank row
For all valid lanes calculate the bank row = (offset & !0x3F)
RESOLVER while (valid_lanes) // create wavefronts until all valid lanes have access shared memory
// for 32-bit read/write this will at worst be 32 loop iterations

6a. From lowest valid lane (thread) to highest set banks

    uint32_t valid_lanes                = STEP1
    uint32_t lane_bank[MAX_LANES]       = STEP4
    uint32_t lane_bank_row[MAX_LANES]   = STEP5
    
    uint32_t bank_row[MAX_BANKS] = {UNDEFINED}
    for (i = 0; i < MAX_LANES; ++i)
    {
        bank = lane_bank[i]
        row  = lane_bank_rows[i]
        uint32_t bank_lane_mask[MAX_BANKS] = 0
        if (bank_row[bank] == UNDEFINED     // first lane to access bank
            || bank_row[bank] == row)       // !first lane but matching row
        {
            bank_row[bank] = row;
            bank_lane_mask[bank] |= 1 << i;
            // remove lane from valid lanes as it will be completed this wavefronts
            valid_lanes &= ~(1 << i);
        }
     }

6b. BANKS Each bank reads 32-bits from bank_row.
6c. CROSSBAR The shared memory crossbar is configured to output the bank data to the correct threads defined by the bank_lane_masks.
6d. The per thread data is written to a per sub-partition FIFO. The FIFO data is written back to the sub-partition register file.
6e. When all data has reached the register file the instruction is marked as retired. This step will also resolve any scoreboard blocking dependent instructions.

The shared memory unit processes warp instructions.

Supporting …

More banks would increase the area and latency of both the resolver and crossbar. The write and read path is multiples of 32-bits (matching registers).
Multiple warps would greatly increase the complexity of crossbar as the instruction meta data for sub-partition and RF location would need to be per lane.
Multiple warps would greatly increase the implementation details of store conflicts and atomics.
Multiple warps will greatly increase bank arbitration conflicts. On GV100+ the shared memory, load/store, and texture path all use the same SRAM so there is already bank arbitration by multiple clients.

All access are 32-bit. 64-bit is just 2 banks of 32-bit. The truncation and shift to 8-bit or 16-bit value can be done in the write-back path.

Topic		Replies	Views
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	15180	November 19, 2025
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2782	March 31, 2010
Requesting clarification for Non contiguous shared memory access by threads of a warp with no bank conflicts CUDA Programming and Performance hw , cuda	5	485	February 21, 2024
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	14	3684	November 20, 2025
Bank Conflict when each thread accesses 2 elements CUDA Programming and Performance	8	5705	July 9, 2010
How warp serialization works on shared memory How to run a "data[n] += something" efficientl CUDA Programming and Performance	26	3500	May 26, 2010
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	1066	November 17, 2017
128-bit access bank conflict CUDA Programming and Performance	11	1276	March 29, 2024
handle bank conflicts on shared memory of Fermi devices? How does the hardware work CUDA Programming and Performance	5	7003	November 15, 2010
Shared memory bank conflict CUDA Programming and Performance	4	535	July 30, 2025

Requesting clarification for Shared Memory Bank Conflicts and Shared memory access?

Related topics