CUDA PTX cp.async.cg performs differently on Ampere and Hopper

hyaloids · July 4, 2024, 7:16am

Hi, I’m using cp.async.cg.shared.global [%0], [%1], 16; to asynchronous copy data from global memory to shared memory.
I’m running the same code on Ampere and Hopper, in Ampere the code works fine, but on Hopper an error occured:

========= Invalid __shared__ write of size 16 bytes
=========     at 0x13c0 in /home/zhaohs/tmp_spmm/spmm_compute/src/ptx_tf32.h:57:async_copy_idx(unsigned int, const unsigned int *)
=========     by thread (1,0,0) in block (201,0,0)
=========     Address 0x8c0 is out of bounds
=========     Device Frame:/home/zhaohs/tmp_spmm/spmm_compute/src/mma_tf32.h:511:tf32_computeX128(const unsigned long *, const unsigned int *, const float *, const unsigned int *, const unsigned int *, const float *, float *, unsigned int, unsigned int) [0x1390]
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x32e130]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__cudart800 [0x1b4cb]
=========                in /home/zhaohs/tmp_spmm/spmm_compute/./mma_tf32
=========     Host Frame:cudaLaunchKernel [0x7766b]
=========                in /home/zhaohs/tmp_spmm/spmm_compute/./mma_tf32
=========     Host Frame:__device_stub__Z16tf32_computeX128PKmPKjPKfS2_S2_S4_Pfjj(unsigned long const*, unsigned int const*, float const*, unsigned int const*, unsigned int const*, float const*, float*, unsigned int, unsigned int) [0xf16c]
=========                in /home/zhaohs/tmp_spmm/spmm_compute/./mma_tf32
=========     Host Frame:tf32_spmm(METCFBit<float>&, BME<float>&, COO<float>*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool) [0x1048c]

Does anyone know why this happens?
Any help would be so appreciated!

striker159 · July 4, 2024, 9:07am

This error typically means there are index calculation errors or the shared memory size is calculated incorrectly.
Without a minimal reproducer, it will be hard to find out the exact cause.

Curefab · July 4, 2024, 9:14am

That is the relevant error.

hyaloids · July 4, 2024, 10:26am

Hi, I give a minimal reproducer in the reply, could you please help and take a look at it. Thanks a lot!

Curefab · July 4, 2024, 10:31am

Is this, what you intended? << 2 multiplies by 4, << 4 multiplies by 16; together you multiply by 64.
Your shared memory array size is 16 only.

You also have a problem with the memory you are copying from. Both only are large enough for the first thread (tid == 0), not the second one (tid == 1).

hyaloids · July 4, 2024, 10:34am

I got it, thanks! But BTW, why does it run correct in Ampere?

Curefab · July 4, 2024, 10:36am

Perhaps Ampere allocates more memory (rounded up) or has coarser memory protection mechanisms? Not sure, why it worked. Especially with the sanitizer.

Perhaps somebody else can point out, which sanitizer settings would have caught this.

hyaloids · July 4, 2024, 10:40am

Got it! Thank you again!

system · July 18, 2024, 10:40am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem about PTX instruction cp.async.ca.shared.global CUDA Programming and Performance	3	2307	September 1, 2022
Shared memory problem CUDA Programming and Performance	4	1551	April 1, 2009
When reading scattered data for a single warp in CUDA, how can we achieve coalesced memory access? CUDA Programming and Performance	7	421	March 15, 2024
Question about CUDA_SAFE_CALL(cudaMemcpy(hostPx, CUDA_SAFE_CALL(cudaMemcpy(hostPx, device CUDA Programming and Performance	6	47469	January 23, 2009
syncthreads() issue CUDA Programming and Performance	3	1670	March 29, 2009
program crash when copying from device to host <br /> CUDA Programming and Performance	11	1877	March 31, 2009
copying memory to and from 3D pitched pointers CUDA Programming and Performance	6	6807	August 2, 2011
Bug for __pipeline_memcpy_async CUDA Programming and Performance	2	1807	October 12, 2021
Is there a support for copy from shared memory to global memory without using registers? CUDA Programming and Performance cuda	7	103	October 9, 2024
Invalid __global__ write of size 4. Need help with debugging CUDA Programming and Performance cuda	3	943	October 12, 2021

CUDA PTX cp.async.cg performs differently on Ampere and Hopper

Related topics