Hi, I’m using cp.async.cg.shared.global [%0], [%1], 16;
to asynchronous copy data from global memory to shared memory.
I’m running the same code on Ampere and Hopper, in Ampere the code works fine, but on Hopper an error occured:
========= Invalid __shared__ write of size 16 bytes
========= at 0x13c0 in /home/zhaohs/tmp_spmm/spmm_compute/src/ptx_tf32.h:57:async_copy_idx(unsigned int, const unsigned int *)
========= by thread (1,0,0) in block (201,0,0)
========= Address 0x8c0 is out of bounds
========= Device Frame:/home/zhaohs/tmp_spmm/spmm_compute/src/mma_tf32.h:511:tf32_computeX128(const unsigned long *, const unsigned int *, const float *, const unsigned int *, const unsigned int *, const float *, float *, unsigned int, unsigned int) [0x1390]
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x32e130]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:__cudart800 [0x1b4cb]
========= in /home/zhaohs/tmp_spmm/spmm_compute/./mma_tf32
========= Host Frame:cudaLaunchKernel [0x7766b]
========= in /home/zhaohs/tmp_spmm/spmm_compute/./mma_tf32
========= Host Frame:__device_stub__Z16tf32_computeX128PKmPKjPKfS2_S2_S4_Pfjj(unsigned long const*, unsigned int const*, float const*, unsigned int const*, unsigned int const*, float const*, float*, unsigned int, unsigned int) [0xf16c]
========= in /home/zhaohs/tmp_spmm/spmm_compute/./mma_tf32
========= Host Frame:tf32_spmm(METCFBit<float>&, BME<float>&, COO<float>*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool) [0x1048c]
Does anyone know why this happens?
Any help would be so appreciated!