The logical model of shared memory as a 2D array runs into complications as it misses the key that each bank can access a different row. It may be better to think of it as 32 (banks=column) separate arrays of N 32-bit value (1-wide row). On each cycle each bank can read/write one of the N values. The three key elements to shared memory are:
- Conflict resolver - Each cycle the resolver determines the maximum set of threads that can read banks without a conflict.
- Banks - 32 independent 32-bit wide banks.
- Crossbar - Allows each warp lane to select any one of the 32 banks.
EXAMPLE : Load from Shared Memory
- The shared memory unit receives a load instruction from a SM sub-partition. The unit removes all inactive threads (lanes) and all threads (lanes) that are predicated off from further processing.
- If instruction uses generic address (LD, ST, ATOM) then convert from 32-bit or 64-bit generic address to shared memory offset by subtracting the generic shared memory window base address.
- For all valid lanes check offset is with valid shared memory size. If any address is out of bounds then throw hardware exception.
- For all valid lanes calculate the bank = (offset >> 2) & 0x1F
offset[1:0] is the byte offset in the bank
offset[6:2] is the bank #
offset[n:7] is the bank row - For all valid lanes calculate the bank row = (offset & !0x3F)
- RESOLVER while (valid_lanes) // create wavefronts until all valid lanes have access shared memory
// for 32-bit read/write this will at worst be 32 loop iterations
6a. From lowest valid lane (thread) to highest set banks
uint32_t valid_lanes = STEP1
uint32_t lane_bank[MAX_LANES] = STEP4
uint32_t lane_bank_row[MAX_LANES] = STEP5
uint32_t bank_row[MAX_BANKS] = {UNDEFINED}
for (i = 0; i < MAX_LANES; ++i)
{
bank = lane_bank[i]
row = lane_bank_rows[i]
uint32_t bank_lane_mask[MAX_BANKS] = 0
if (bank_row[bank] == UNDEFINED // first lane to access bank
|| bank_row[bank] == row) // !first lane but matching row
{
bank_row[bank] = row;
bank_lane_mask[bank] |= 1 << i;
// remove lane from valid lanes as it will be completed this wavefronts
valid_lanes &= ~(1 << i);
}
}
6b. BANKS Each bank reads 32-bits from bank_row.
6c. CROSSBAR The shared memory crossbar is configured to output the bank data to the correct threads defined by the bank_lane_masks.
6d. The per thread data is written to a per sub-partition FIFO. The FIFO data is written back to the sub-partition register file.
6e. When all data has reached the register file the instruction is marked as retired. This step will also resolve any scoreboard blocking dependent instructions.
The shared memory unit processes warp instructions.
Supporting …
- More banks would increase the area and latency of both the resolver and crossbar. The write and read path is multiples of 32-bits (matching registers).
- Multiple warps would greatly increase the complexity of crossbar as the instruction meta data for sub-partition and RF location would need to be per lane.
- Multiple warps would greatly increase the implementation details of store conflicts and atomics.
- Multiple warps will greatly increase bank arbitration conflicts. On GV100+ the shared memory, load/store, and texture path all use the same SRAM so there is already bank arbitration by multiple clients.
All access are 32-bit. 64-bit is just 2 banks of 32-bit. The truncation and shift to 8-bit or 16-bit value can be done in the write-back path.