coalesced access and hardware Load/Store units

I have few questions related to the memory access inside a streaming multiprocessor (SM)

As far as i understand;

  • load and store instructions are handled by the SM LD/ST units which operates on L1 cache data. And each LD/ST unit handles the memory operation of a single thread.

  • LD/ST units work on chunks of data e.g. 32 bytes per transaction

Is this correct ? And if so, when threads inside a warp read/write to neighboring locations (coalesced access) does all these reads/writes get assigned to a single LD/ST unit ? given that all these reads/write are within the chunk range of addresses

I mean do we care about coalesced memory access in order to utilize these LD/ST units ? or to avoid the cache miss penalty ?

  1. NVidia GPUs are 32-wide simd processors. This means that each register contains 32 4-byte entities, and each operation performed on each of these 32 entities simultaneously. The so-called warp is a real thread of GPU execution.

  2. When ld/st engine gets a memory access request, it need to process 32 loads (or stores) in a single operation. It splits those 32 addresses into groups covering 32-byte memory blocks. Each group is executed in a single cycle. So, coalescing memory accesses makes memory operations faster

  3. Caching is also done by 32-byte blocks

@BulatZiganshin

Thanks so much for your answer.
So if we have 8 threads from a warp that are about to execute a load instruction each of them will load a 4-byte integer from neighboring addresses; and we have two ld/st units that are available i.e. finished executing the group of 32-byte transactions assigned to them.

Can we say that all these 8 load instructions will be assigned to these two idle ld/st units, and the 8 operations will be executed in one cycle ?

let’s see. The so-called SM is similar to module in various CPUs - it includes 2-4 cores and some common resources (L1 cache, shared memory…)

Each core (called “dispatcher” in the NVidia lingua) has a number of execution units. F.e. on Maxwell, these units includes ALU, LD/ST, SFU, branching and a few more. There is only one unit of each type, though!

Each unit is SIMD. ALU unit is 32-wide, i.e. it can execute operation on all 32 register elements in a single cycle.

LD/ST unit is 8-wide. So, at single cycle it can read or store 8 elements of register, requiring 4 cycles to process the entire register. Also, on a single cycle it can process only 32-byte memory block, so if operation require more than 4 such blocks, it will spend more than 4 cycles.

This is about throughput, operation delay is much longer (dozens of cycles), but OTOH GPU doesn’t wait for LD/ST operations to finish until operation that will use data loaded into register.

And of course, for the shared memory rules are different.

Thanks so much for this great answer. It is clear for me now