Understanding Data Fetching Mechanics During a Load Instruction in CUDA

Hello NVIDIA Developer Community,
I am trying to deepen my understanding of how data fetching works during a load instruction in CUDA, specifically regarding the behavior of warps and memory transactions. My question revolves around the mechanics of how data is fetched from global memory and utilized by the threads within a warp.
When a load instruction is issued in CUDA, it triggers a memory transaction that fetches a consecutive chunk of data (either 32 bytes or 128 bytes, depending on the architecture) from global memory. My question is about how this fetched data is handled by the Streaming Multiprocessor (SM) and the individual threads within a warp:

  • The load instruction is executed serially by the SM? so that this chunk is fetched to all the registers in each thread of the warp or each thread In the warp when it start executing the load instruction request some data that will be stored in cache, and thus it speed ups the next loads (if they data are in the same cache line) by allowing the sequent threads in the warp to access data from cache ?
    Or the load instruction will be executed in parallel within a warp (it seems unrealistic to me right now), but in this way
    still just one thread can fetch a memory line at a time since we know that I cannot do multiple memory transactions because the transactions are going to be serialized (memory coalescence), so how does a load instruction fetch data in CUDA?