[H100 | SM90a] Illegal Instruction in TMA Load (tma_copy_3d) with CUDA 12.9

Device: H100
Compiler: nvcc 12.9
Architecture: sm_90a
Kernel Launch Configuration:
Grid: (2, 66, 1)
Block: (384, 1, 1)
Cluster: (2, 1, 1)
Shared Memory: 94592 bytes

Problem:
When running my SpMM kernel with compute-sanitizer, I encountered multiple Illegal Instruction errors.

Error Summary:

  1. Illegal instruction occurs in tma_copy_3d.
========= ========= Illegal instruction
=========     at tma_copy_3d+0x14e90 in h_SpMM_Kernel.cuh:337
=========     by thread (0,0,0) in block (0,4,0)
=========         Device Frame: load+0x14ec0 in h_SpMM_Kernel.cuh:1462
=========         Device Frame: void Kernel<h_spinfer::HTilingConfig<(int)4, (int)1, (int)1, (int)1>, __half, __half, float, float>(const __half *, const __half *, const int *, const int *, const int *, const unsigned long *, const int *, const __half *, __half *, __half *, int, int, int, int, CUtensorMap_st)+0x165f0 in h_SpMM_Kernel.cuh:1902
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: cudaLaunchKernelExC [0x83103] in spmm_test
=========         Host Frame: void SpMM_SplitK_Kernel_Ex_bitmap_v4<h_spinfer::HTilingConfig<4, 1, 1, 1> >(CUstream_st*, __half const*, __half const*, int const*, int const*, int const*, unsigned long const*, int const*, __half const*, __half*, __half*, int, int, int, int) in h_SpMM_API.cu:120 [0x13f81] in spmm_test
=========         Host Frame: SpMM_SplitK_API_bitmap_v4(CUstream_st*, __half const*, __half const*, int const*, int const*, int const*, unsigned long const*, int const*, __half const*, __half*, int, int, int, __half*, int) in h_SpMM_API.cu:158 [0x132b5] in spmm_test
=========         Host Frame: main in spmm_test.cu:576 [0xc3e7] in spmm_test
========= 
  • This corresponds to the tma_copy_3d function in my code.

  1. Illegal instruction at __cvta_generic_to_shared, traced to cast_smem_ptr_to_uint at h_SpMM_Kernel.cuh:330, also inside tma_copy_3d.
========= Illegal instruction
=========     at cast_smem_ptr_to_uint+0x14c50 in common.h:403
=========     by thread (0,0,0) in block (0,5,0)
=========         Device Frame: tma_copy_3d+0x14ca0 in h_SpMM_Kernel.cuh:330
=========         Device Frame: load+0x14ec0 in h_SpMM_Kernel.cuh:1462
=========         Device Frame: void Kernel<h_spinfer::HTilingConfig<(int)4, (int)1, (int)1, (int)1>, __half, __half, float, float>(const __half *, const __half *, const int *, const int *, const int *, const unsigned long *, const int *, const __half *, __half *, __half *, int, int, int, int, CUtensorMap_st)+0x165f0 in h_SpMM_Kernel.cuh:1902
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: cudaLaunchKernelExC [0x83103] in spmm_test
=========         Host Frame: void SpMM_SplitK_Kernel_Ex_bitmap_v4<h_spinfer::HTilingConfig<4, 1, 1, 1> >(CUstream_st*, __half const*, __half const*, int const*, int const*, int const*, unsigned long const*, int const*, __half const*, __half*, __half*, int, int, int, int) in h_SpMM_API.cu:120 [0x13f81] in spmm_test
=========         Host Frame: SpMM_SplitK_API_bitmap_v4(CUstream_st*, __half const*, __half const*, int const*, int const*, int const*, unsigned long const*, int const*, __half const*, __half*, int, int, int, __half*, int) in h_SpMM_API.cu:158 [0x132b5] in spmm_test
=========         Host Frame: main in spmm_test.cu:576 [0xc3e7] in spmm_test

  1. Another illegal instruction is triggered after copy_tma_3d, specifically at h_SpMM_Kernel.cuh:1499.
========= Illegal instruction
=========     at load+0x14ef0 in h_SpMM_Kernel.cuh:1499
=========     by thread (0,0,0) in block (0,6,0)
=========         Device Frame: void Kernel<h_spinfer::HTilingConfig<(int)4, (int)1, (int)1, (int)1>, __half, __half, float, float>(const __half *, const __half *, const int *, const int *, const int *, const unsigned long *, const int *, const __half *, __half *, __half *, int, int, int, int, CUtensorMap_st)+0x165f0 in h_SpMM_Kernel.cuh:1902
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: cudaLaunchKernelExC [0x83103] in spmm_test
=========         Host Frame: void SpMM_SplitK_Kernel_Ex_bitmap_v4<h_spinfer::HTilingConfig<4, 1, 1, 1> >(CUstream_st*, __half const*, __half const*, int const*, int const*, int const*, unsigned long const*, int const*, __half const*, __half*, __half*, int, int, int, int) in h_SpMM_API.cu:120 [0x13f81] in spmm_test
=========         Host Frame: SpMM_SplitK_API_bitmap_v4(CUstream_st*, __half const*, __half const*, int const*, int const*, int const*, unsigned long const*, int const*, __half const*, __half*, int, int, int, __half*, int) in h_SpMM_API.cu:158 [0x132b5] in spmm_test
=========         Host Frame: main in spmm_test.cu:576 [0xc3e7] in spmm_test
  • According to the log, the crash happens at h_SpMM_Kernel.cuh:1499. If I uncomment lines 1474–1497, then the crash moves to line 1474, still showing an illegal instruction.

I believe the tma_copy_3d function itself should be correct, as I based it on this template:
👉 MatmulTutorial/examples/matmul/this-sm90/matmul-pingpong-v1.cu at main · KnowingNothing/MatmulTutorial · GitHub

I suspect the issue could be related to an invalid operand passed to a low-level instruction, but I’m not sure how to verify this.

Build Configuration:

# Environment
CUDA_PATH = /data/home/tester/zkg/local/cuda-12.9/
nvcc = $(CUDA_PATH)/bin/nvcc
SM = 90a

# Compilation flags
NVCCFLAGS += --std=c++17 --threads 0 --use_fast_math -maxrregcount=255
NVCCFLAGS += --ptxas-options=-v,-warn-lmem-usage,--warn-on-spills
NVCCFLAGS += -g -G -lineinfo
GENCODE_FLAGS += -gencode arch=compute_$(SM),code=sm_$(SM)

# Linking libraries
LIBRARIES += -lcuda -lcublas -lcusparse
# (Optional) -lcusparseLt if needed