Understanding Data Fetching Mechanics During a Load Instruction in CUDA

sillas.1802 · April 19, 2024, 11:16am

Hello NVIDIA Developer Community,
I am trying to deepen my understanding of how data fetching works during a load instruction in CUDA, specifically regarding the behavior of warps and memory transactions. My question revolves around the mechanics of how data is fetched from global memory and utilized by the threads within a warp.
When a load instruction is issued in CUDA, it triggers a memory transaction that fetches a consecutive chunk of data (either 32 bytes or 128 bytes, depending on the architecture) from global memory. My question is about how this fetched data is handled by the Streaming Multiprocessor (SM) and the individual threads within a warp:

The load instruction is executed serially by the SM? so that this chunk is fetched to all the registers in each thread of the warp or each thread In the warp when it start executing the load instruction request some data that will be stored in cache, and thus it speed ups the next loads (if they data are in the same cache line) by allowing the sequent threads in the warp to access data from cache ?
Or the load instruction will be executed in parallel within a warp (it seems unrealistic to me right now), but in this way
still just one thread can fetch a memory line at a time since we know that I cannot do multiple memory transactions because the transactions are going to be serialized (memory coalescence), so how does a load instruction fetch data in CUDA?

Topic		Replies	Views
What happens for load instructions ? CUDA Programming and Performance	3	5583	July 22, 2011
Will there ever be a time when memory reads are NOT done in a contiguous 128 byte chunk? CUDA Programming and Performance	0	528	July 9, 2015
Performance loading overlapping values of global array within warp CUDA Programming and Performance	5	753	August 20, 2017
CUDA memory transactions CUDA Programming and Performance	9	8962	April 11, 2011
Reading from global memory to registers in a fast way CUDA Programming and Performance	10	2357	November 15, 2021
Trouble in understanding this concept CUDA Programming and Performance	2	550	September 17, 2018
Curious about memory loads into cores CUDA Programming and Performance	2	532	June 6, 2020
Dual warp scheduler...quick question... CUDA Programming and Performance	0	1097	July 23, 2010
Data load question CUDA Programming and Performance	3	103	December 18, 2024
Why reading one byte produces multiple global load l2 transactions? CUDA Programming and Performance	3	1262	August 30, 2018

Understanding Data Fetching Mechanics During a Load Instruction in CUDA

Related topics