Hello, there’s an example in §7.29.1 (the section on TMA) of the Cuda C++ manual which I compiled and got what to me seemed like strange results. Here’s a gist I made with the source and the ptx. (I had to adjust a couple details bcs my machine doesn’t have the <cuda/ptx>
header.)
The short version is that it does not use cp.async.bulk
at all for the gmem ~> smem
direction, but it does in the other direction.
My compilation command was nvcc tma_example.cu -ptx -src-in-ptx -arch=sm_90
. My version info is
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
If I put the same source into Godbolt (using nvcc 12.4.1), it does what I expect and the copies in both directions use cp.async.bulk
.