Using globalToShmemAsyncCopy

I am trying to make use of the globalToShmemAsyncCopy to load tiles of a sparse matrix to multiply it later but I can’t get it to work. What are the requisites to have the Global to Shmem copy working ?

See here, all of sections B.26 and B.27. You can see that there are multiple different API patterns you can use to tap into this capability. If you’re just getting started, try the code here.

The only requisites are that you are using at least compute capability 7.0 and preferably compute capability 8.0 or higher.

That’s interesting, I am actually using the same pattern as AsyncCopyLargeChunk that is provided on cuda-samples globalToShmemAsyncCopy. The code runs correctly but still when I run nsight compute the memory copy does not use the global to shmem path.

I initially assumed that it had something to do with the random memory access that I use later on but I couldn’t find any documentation on it.

ps: I am currently testing on Volta and Ampere Tesla cards.

Hi Robert,

I have done some testing by profiling the cuda-samples in different branches. Apparently it’s something related to drivers and cuda version on our slurm/singularity setup.

Any time you are having trouble with a CUDA code, I usually recommend using proper CUDA error checking as well as run your code with compute-sanitizer. Those steps will usually turn up driver/version issues and reduce confusion.

I believe the trick part is that it works but the optimisation doesn’t kick in, so I couldn’t see any errors but it didn’t perform as expected. Once I got the sample working, it was trivial to transfer the logic in.