coalesced access and hardware Load/Store units

Elbehery · July 4, 2017, 2:52am

I have few questions related to the memory access inside a streaming multiprocessor (SM)

As far as i understand;

load and store instructions are handled by the SM LD/ST units which operates on L1 cache data. And each LD/ST unit handles the memory operation of a single thread.
LD/ST units work on chunks of data e.g. 32 bytes per transaction

Is this correct ? And if so, when threads inside a warp read/write to neighboring locations (coalesced access) does all these reads/writes get assigned to a single LD/ST unit ? given that all these reads/write are within the chunk range of addresses

I mean do we care about coalesced memory access in order to utilize these LD/ST units ? or to avoid the cache miss penalty ?

BulatZiganshin · July 4, 2017, 2:20pm

NVidia GPUs are 32-wide simd processors. This means that each register contains 32 4-byte entities, and each operation performed on each of these 32 entities simultaneously. The so-called warp is a real thread of GPU execution.
When ld/st engine gets a memory access request, it need to process 32 loads (or stores) in a single operation. It splits those 32 addresses into groups covering 32-byte memory blocks. Each group is executed in a single cycle. So, coalescing memory accesses makes memory operations faster
Caching is also done by 32-byte blocks

Elbehery · July 4, 2017, 4:03pm

@BulatZiganshin

Thanks so much for your answer.
So if we have 8 threads from a warp that are about to execute a load instruction each of them will load a 4-byte integer from neighboring addresses; and we have two ld/st units that are available i.e. finished executing the group of 32-byte transactions assigned to them.

Can we say that all these 8 load instructions will be assigned to these two idle ld/st units, and the 8 operations will be executed in one cycle ?

BulatZiganshin · July 4, 2017, 5:26pm

let’s see. The so-called SM is similar to module in various CPUs - it includes 2-4 cores and some common resources (L1 cache, shared memory…)

Each core (called “dispatcher” in the NVidia lingua) has a number of execution units. F.e. on Maxwell, these units includes ALU, LD/ST, SFU, branching and a few more. There is only one unit of each type, though!

Each unit is SIMD. ALU unit is 32-wide, i.e. it can execute operation on all 32 register elements in a single cycle.

LD/ST unit is 8-wide. So, at single cycle it can read or store 8 elements of register, requiring 4 cycles to process the entire register. Also, on a single cycle it can process only 32-byte memory block, so if operation require more than 4 such blocks, it will spend more than 4 cycles.

This is about throughput, operation delay is much longer (dozens of cycles), but OTOH GPU doesn’t wait for LD/ST operations to finish until operation that will use data loaded into register.

And of course, for the shared memory rules are different.

Elbehery · July 6, 2017, 12:26am

Thanks so much for this great answer. It is clear for me now

Topic		Replies	Views
What is the functionality of LD/ST units in SM? GPU - Hardware	4	751	May 23, 2024
on load/ store units CUDA Programming and Performance	2	1016	November 19, 2014
Accessing same global memory address within warps CUDA Programming and Performance	4	4386	October 24, 2018
Memory coalescing in one thread CUDA Programming and Performance	17	16769	March 31, 2011
LDS.128 loads from shared memory CUDA Programming and Performance	3	732	September 11, 2023
Coalesced Memory access related doubt CUDA Programming and Performance	13	2205	December 9, 2010
Why only half-warp? CUDA Programming and Performance	6	12878	April 15, 2010
32 byte coalesced access is faster than 128 byte coalesced access? CUDA Programming and Performance	3	1136	October 12, 2021
Why half-warp coalesced memory reads? CUDA Programming and Performance	4	4223	September 10, 2009
Bandwidth of shared memory load CUDA Programming and Performance	1	163	June 17, 2024

coalesced access and hardware Load/Store units

Related topics