I have few questions related to the memory access inside a streaming multiprocessor (SM)
As far as i understand;
-
load and store instructions are handled by the SM LD/ST units which operates on L1 cache data. And each LD/ST unit handles the memory operation of a single thread.
-
LD/ST units work on chunks of data e.g. 32 bytes per transaction
Is this correct ? And if so, when threads inside a warp read/write to neighboring locations (coalesced access) does all these reads/writes get assigned to a single LD/ST unit ? given that all these reads/write are within the chunk range of addresses
I mean do we care about coalesced memory access in order to utilize these LD/ST units ? or to avoid the cache miss penalty ?
@BulatZiganshin
Thanks so much for your answer.
So if we have 8 threads from a warp that are about to execute a load instruction each of them will load a 4-byte integer from neighboring addresses; and we have two ld/st units that are available i.e. finished executing the group of 32-byte transactions assigned to them.
Can we say that all these 8 load instructions will be assigned to these two idle ld/st units, and the 8 operations will be executed in one cycle ?
let’s see. The so-called SM is similar to module in various CPUs - it includes 2-4 cores and some common resources (L1 cache, shared memory…)
Each core (called “dispatcher” in the NVidia lingua) has a number of execution units. F.e. on Maxwell, these units includes ALU, LD/ST, SFU, branching and a few more. There is only one unit of each type, though!
Each unit is SIMD. ALU unit is 32-wide, i.e. it can execute operation on all 32 register elements in a single cycle.
LD/ST unit is 8-wide. So, at single cycle it can read or store 8 elements of register, requiring 4 cycles to process the entire register. Also, on a single cycle it can process only 32-byte memory block, so if operation require more than 4 such blocks, it will spend more than 4 cycles.
This is about throughput, operation delay is much longer (dozens of cycles), but OTOH GPU doesn’t wait for LD/ST operations to finish until operation that will use data loaded into register.
And of course, for the shared memory rules are different.
1 Like
Thanks so much for this great answer. It is clear for me now