Data load question

half-0 · December 18, 2024, 2:01am

If I need to load array A from global memory to each threads register, will it be faster to load it to shared memory first, then load to registers? (Assume there are no bank conflict, and all data are aligned).
Here is my case. I have 4 warps per CTA, each warp need N bytes from array A (in gmem). But warp 0, 1 need the same, 2, 3 need the same. Will it be faster to load A to smem before load to registers in this specific case? What about the general case?

Curefab · December 18, 2024, 12:17pm

What do you mean with load to shared memory? With asynchronous memory copy operations or by loading it with the threads and storing it to shared memory?

If you need the same bytes in different lanes, they only have to be loaded once. So that is efficient.
But you should at least load (aligned) 32 byte blocks (coalescing) with each access, regardless which threads needs those.
Sometimes it helps to do 8-bytes or 16-bytes accesses to achieve that.

half-0 · December 18, 2024, 3:01pm

Not async. I will try to load to smem and then to register. Thanks!

Curefab · December 18, 2024, 3:13pm

I made a mistake.
You wrote about warps needing the same data, not threads.

Then you should either exchange with shared memory or trust the L1 cache.
Let only the even warps load data from global memory and write to shared memory.
Afterwards let only the odd warps load data from shared memory.

You can either use a block-wide sync or use inline ptx for numbered barriers for pairwise warps. The second option is slightly faster, as not all warps are blocked.

Topic		Replies	Views
Reading from global memory to registers in a fast way CUDA Programming and Performance	10	2326	November 15, 2021
performance for global and shared memory CUDA Programming and Performance	2	6285	January 15, 2008
Question regarding transfer from global to shared memory CUDA Programming and Performance	5	6067	November 27, 2010
Copying data from global memory to shared memory by each thread CUDA Programming and Performance	6	17290	January 7, 2022
Loading global memory values into shared memory CUDA Programming and Performance	2	939	April 19, 2013
Load data for tensor core CUDA Programming and Performance	23	263	February 5, 2025
Worth loading all to shared memory? CUDA Programming and Performance	2	2669	February 25, 2008
Coalescing into shared memory CUDA Programming and Performance	1	2033	December 13, 2008
Efficiently loading data in the shared memory CUDA Programming and Performance	0	370	February 15, 2021
Global memory access CUDA Programming and Performance	2	806	August 10, 2016

Data load question

Related topics