CUDA PTX cp.async.cg performs differently on Ampere and Hopper

Hi, I’m using cp.async.cg.shared.global [%0], [%1], 16; to asynchronous copy data from global memory to shared memory.
I’m running the same code on Ampere and Hopper, in Ampere the code works fine, but on Hopper an error occured:

========= Invalid __shared__ write of size 16 bytes
=========     at 0x13c0 in /home/zhaohs/tmp_spmm/spmm_compute/src/ptx_tf32.h:57:async_copy_idx(unsigned int, const unsigned int *)
=========     by thread (1,0,0) in block (201,0,0)
=========     Address 0x8c0 is out of bounds
=========     Device Frame:/home/zhaohs/tmp_spmm/spmm_compute/src/mma_tf32.h:511:tf32_computeX128(const unsigned long *, const unsigned int *, const float *, const unsigned int *, const unsigned int *, const float *, float *, unsigned int, unsigned int) [0x1390]
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x32e130]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__cudart800 [0x1b4cb]
=========                in /home/zhaohs/tmp_spmm/spmm_compute/./mma_tf32
=========     Host Frame:cudaLaunchKernel [0x7766b]
=========                in /home/zhaohs/tmp_spmm/spmm_compute/./mma_tf32
=========     Host Frame:__device_stub__Z16tf32_computeX128PKmPKjPKfS2_S2_S4_Pfjj(unsigned long const*, unsigned int const*, float const*, unsigned int const*, unsigned int const*, float const*, float*, unsigned int, unsigned int) [0xf16c]
=========                in /home/zhaohs/tmp_spmm/spmm_compute/./mma_tf32
=========     Host Frame:tf32_spmm(METCFBit<float>&, BME<float>&, COO<float>*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool) [0x1048c]

Does anyone know why this happens?
Any help would be so appreciated!

This error typically means there are index calculation errors or the shared memory size is calculated incorrectly.
Without a minimal reproducer, it will be hard to find out the exact cause.

That is the relevant error.

Hi, I give a minimal reproducer in the reply, could you please help and take a look at it. Thanks a lot!

Is this, what you intended? << 2 multiplies by 4, << 4 multiplies by 16; together you multiply by 64.
Your shared memory array size is 16 only.

You also have a problem with the memory you are copying from. Both only are large enough for the first thread (tid == 0), not the second one (tid == 1).

1 Like

I got it, thanks! But BTW, why does it run correct in Ampere?

Perhaps Ampere allocates more memory (rounded up) or has coarser memory protection mechanisms? Not sure, why it worked. Especially with the sanitizer.

Perhaps somebody else can point out, which sanitizer settings would have caught this.

1 Like

Got it! Thank you again!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.