Strange ptx output, removes tma

Hello, there’s an example in §7.29.1 (the section on TMA) of the Cuda C++ manual which I compiled and got what to me seemed like strange results. Here’s a gist I made with the source and the ptx. (I had to adjust a couple details bcs my machine doesn’t have the <cuda/ptx> header.)

The short version is that it does not use cp.async.bulk at all for the gmem ~> smem direction, but it does in the other direction.

My compilation command was nvcc tma_example.cu -ptx -src-in-ptx -arch=sm_90. My version info is

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

If I put the same source into Godbolt (using nvcc 12.4.1), it does what I expect and the copies in both directions use cp.async.bulk.

seems like there was a change between CUDA 12.1 and 12.4 in this respect. If its of concern to you, you should move forward to CUDA 12.4. Here is godbolt on CUDA 12.1 (I also note that the 12.1 version of the programming guide corresponding to your 12.1 version nvcc doesn’t seem to have the same section describing TMA - the 12.1 programming guide doesn’t say much at all about TMA).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.