coalesced access and hardware Load/Store units

I have few questions related to the memory access inside a streaming multiprocessor (SM)

As far as i understand;

  • load and store instructions are handled by the SM LD/ST units which operates on L1 cache data. And each LD/ST unit handles the memory operation of a single thread.

  • LD/ST units work on chunks of data e.g. 32 bytes per transaction

Is this correct ? And if so, when threads inside a warp read/write to neighboring locations (coalesced access) does all these reads/writes get assigned to a single LD/ST unit ? given that all these reads/write are within the chunk range of addresses

I mean do we care about coalesced memory access in order to utilize these LD/ST units ? or to avoid the cache miss penalty ?

  1. NVidia GPUs are 32-wide simd processors. This means that each register contains 32 4-byte entities, and each operation performed on each of these 32 entities simultaneously. The so-called warp is a real thread of GPU execution.

  2. When ld/st engine gets a memory access request, it need to process 32 loads (or stores) in a single operation. It splits those 32 addresses into groups covering 32-byte memory blocks. Each group is executed in a single cycle. So, coalescing memory accesses makes memory operations faster

  3. Caching is also done by 32-byte blocks

@BulatZiganshin

Thanks so much for your answer.
So if we have 8 threads from a warp that are about to execute a load instruction each of them will load a 4-byte integer from neighboring addresses; and we have two ld/st units that are available i.e. finished executing the group of 32-byte transactions assigned to them.

Can we say that all these 8 load instructions will be assigned to these two idle ld/st units, and the 8 operations will be executed in one cycle ?

let’s see. The so-called SM is similar to module in various CPUs - it includes 2-4 cores and some common resources (L1 cache, shared memory…)

Each core (called “dispatcher” in the NVidia lingua) has a number of execution units. F.e. on Maxwell, these units includes ALU, LD/ST, SFU, branching and a few more. There is only one unit of each type, though!

Each unit is SIMD. ALU unit is 32-wide, i.e. it can execute operation on all 32 register elements in a single cycle.

LD/ST unit is 8-wide. So, at single cycle it can read or store 8 elements of register, requiring 4 cycles to process the entire register. Also, on a single cycle it can process only 32-byte memory block, so if operation require more than 4 such blocks, it will spend more than 4 cycles.

This is about throughput, operation delay is much longer (dozens of cycles), but OTOH GPU doesn’t wait for LD/ST operations to finish until operation that will use data loaded into register.

And of course, for the shared memory rules are different.

1 Like

Thanks so much for this great answer. It is clear for me now