Using globalToShmemAsyncCopy

guilhermehartmann · August 3, 2021, 3:08pm

I am trying to make use of the globalToShmemAsyncCopy to load tiles of a sparse matrix to multiply it later but I can’t get it to work. What are the requisites to have the Global to Shmem copy working ?

Robert_Crovella · August 3, 2021, 3:33pm

See here, all of sections B.26 and B.27. You can see that there are multiple different API patterns you can use to tap into this capability. If you’re just getting started, try the code here.

The only requisites are that you are using at least compute capability 7.0 and preferably compute capability 8.0 or higher.

guilhermehartmann · August 3, 2021, 5:32pm

That’s interesting, I am actually using the same pattern as AsyncCopyLargeChunk that is provided on cuda-samples globalToShmemAsyncCopy. The code runs correctly but still when I run nsight compute the memory copy does not use the global to shmem path.

I initially assumed that it had something to do with the random memory access that I use later on but I couldn’t find any documentation on it.

ps: I am currently testing on Volta and Ampere Tesla cards.

guilhermehartmann · August 4, 2021, 1:53pm

Hi Robert,

I have done some testing by profiling the cuda-samples in different branches. Apparently it’s something related to drivers and cuda version on our slurm/singularity setup.

Robert_Crovella · August 4, 2021, 1:55pm

Any time you are having trouble with a CUDA code, I usually recommend using proper CUDA error checking as well as run your code with compute-sanitizer. Those steps will usually turn up driver/version issues and reduce confusion.

guilhermehartmann · August 4, 2021, 2:02pm

I believe the trick part is that it works but the optimisation doesn’t kick in, so I couldn’t see any errors but it didn’t perform as expected. Once I got the sample working, it was trivial to transfer the logic in.

Topic		Replies	Views
The performance result of CUDA sample globalToShmemAsyncCopy is puzzled CUDA Programming and Performance	1	632	December 9, 2021
CUDA PTX cp.async only supports global to shared memory copy CUDA Programming and Performance cuda , performance	2	1296	March 14, 2023
Asynchronous copying on hopper GPU from shared to global CUDA Programming and Performance	2	91	October 28, 2025
Coalesced and conflict free memory access using cuda::memcpy_async/cp.async CUDA Programming and Performance cuda	6	1030	November 13, 2024
async_work_group_copy? CUDA Programming and Performance	0	1847	October 21, 2009
CUDA PTX cp.async.cg performs differently on Ampere and Hopper CUDA Programming and Performance	8	339	July 4, 2024
Distributed shared memory asynchronous memory copy CUDA Programming and Performance	4	134	October 1, 2025
cudaMemcpyAsync not behaving asynchronously CUDA Programming and Performance	5	2530	July 4, 2008
.cg Cache modifier with async copy size of less than 16 CUDA NVCC Compiler cuda	2	75	October 29, 2025
async_work_group_copy CUDA Programming and Performance	2	13239	May 7, 2010

Using globalToShmemAsyncCopy

Related topics