I am trying to make use of the globalToShmemAsyncCopy to load tiles of a sparse matrix to multiply it later but I can’t get it to work. What are the requisites to have the Global to Shmem copy working ?
The only requisites are that you are using at least compute capability 7.0 and preferably compute capability 8.0 or higher.
That’s interesting, I am actually using the same pattern as AsyncCopyLargeChunk that is provided on cuda-samples globalToShmemAsyncCopy. The code runs correctly but still when I run nsight compute the memory copy does not use the global to shmem path.
I initially assumed that it had something to do with the random memory access that I use later on but I couldn’t find any documentation on it.
ps: I am currently testing on Volta and Ampere Tesla cards.
I have done some testing by profiling the cuda-samples in different branches. Apparently it’s something related to drivers and cuda version on our slurm/singularity setup.
Any time you are having trouble with a CUDA code, I usually recommend using proper CUDA error checking as well as run your code with
compute-sanitizer. Those steps will usually turn up driver/version issues and reduce confusion.
I believe the trick part is that it works but the optimisation doesn’t kick in, so I couldn’t see any errors but it didn’t perform as expected. Once I got the sample working, it was trivial to transfer the logic in.