Device: H100
Compiler: nvcc 12.9
Architecture: sm_90a
Kernel Launch Configuration:
Grid: (2, 66, 1)
Block: (384, 1, 1)
Cluster: (2, 1, 1)
Shared Memory: 94592 bytes
Problem:
When running my SpMM kernel with compute-sanitizer, I encountered multiple Illegal Instruction errors.
Error Summary:
- Illegal instruction occurs in
tma_copy_3d.
========= ========= Illegal instruction
========= at tma_copy_3d+0x14e90 in h_SpMM_Kernel.cuh:337
========= by thread (0,0,0) in block (0,4,0)
========= Device Frame: load+0x14ec0 in h_SpMM_Kernel.cuh:1462
========= Device Frame: void Kernel<h_spinfer::HTilingConfig<(int)4, (int)1, (int)1, (int)1>, __half, __half, float, float>(const __half *, const __half *, const int *, const int *, const int *, const unsigned long *, const int *, const __half *, __half *, __half *, int, int, int, int, CUtensorMap_st)+0x165f0 in h_SpMM_Kernel.cuh:1902
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: cudaLaunchKernelExC [0x83103] in spmm_test
========= Host Frame: void SpMM_SplitK_Kernel_Ex_bitmap_v4<h_spinfer::HTilingConfig<4, 1, 1, 1> >(CUstream_st*, __half const*, __half const*, int const*, int const*, int const*, unsigned long const*, int const*, __half const*, __half*, __half*, int, int, int, int) in h_SpMM_API.cu:120 [0x13f81] in spmm_test
========= Host Frame: SpMM_SplitK_API_bitmap_v4(CUstream_st*, __half const*, __half const*, int const*, int const*, int const*, unsigned long const*, int const*, __half const*, __half*, int, int, int, __half*, int) in h_SpMM_API.cu:158 [0x132b5] in spmm_test
========= Host Frame: main in spmm_test.cu:576 [0xc3e7] in spmm_test
=========
- This corresponds to the
tma_copy_3dfunction in my code.
- Illegal instruction at
__cvta_generic_to_shared, traced tocast_smem_ptr_to_uintath_SpMM_Kernel.cuh:330, also insidetma_copy_3d.
========= Illegal instruction
========= at cast_smem_ptr_to_uint+0x14c50 in common.h:403
========= by thread (0,0,0) in block (0,5,0)
========= Device Frame: tma_copy_3d+0x14ca0 in h_SpMM_Kernel.cuh:330
========= Device Frame: load+0x14ec0 in h_SpMM_Kernel.cuh:1462
========= Device Frame: void Kernel<h_spinfer::HTilingConfig<(int)4, (int)1, (int)1, (int)1>, __half, __half, float, float>(const __half *, const __half *, const int *, const int *, const int *, const unsigned long *, const int *, const __half *, __half *, __half *, int, int, int, int, CUtensorMap_st)+0x165f0 in h_SpMM_Kernel.cuh:1902
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: cudaLaunchKernelExC [0x83103] in spmm_test
========= Host Frame: void SpMM_SplitK_Kernel_Ex_bitmap_v4<h_spinfer::HTilingConfig<4, 1, 1, 1> >(CUstream_st*, __half const*, __half const*, int const*, int const*, int const*, unsigned long const*, int const*, __half const*, __half*, __half*, int, int, int, int) in h_SpMM_API.cu:120 [0x13f81] in spmm_test
========= Host Frame: SpMM_SplitK_API_bitmap_v4(CUstream_st*, __half const*, __half const*, int const*, int const*, int const*, unsigned long const*, int const*, __half const*, __half*, int, int, int, __half*, int) in h_SpMM_API.cu:158 [0x132b5] in spmm_test
========= Host Frame: main in spmm_test.cu:576 [0xc3e7] in spmm_test
- Another illegal instruction is triggered after
copy_tma_3d, specifically ath_SpMM_Kernel.cuh:1499.
========= Illegal instruction
========= at load+0x14ef0 in h_SpMM_Kernel.cuh:1499
========= by thread (0,0,0) in block (0,6,0)
========= Device Frame: void Kernel<h_spinfer::HTilingConfig<(int)4, (int)1, (int)1, (int)1>, __half, __half, float, float>(const __half *, const __half *, const int *, const int *, const int *, const unsigned long *, const int *, const __half *, __half *, __half *, int, int, int, int, CUtensorMap_st)+0x165f0 in h_SpMM_Kernel.cuh:1902
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: cudaLaunchKernelExC [0x83103] in spmm_test
========= Host Frame: void SpMM_SplitK_Kernel_Ex_bitmap_v4<h_spinfer::HTilingConfig<4, 1, 1, 1> >(CUstream_st*, __half const*, __half const*, int const*, int const*, int const*, unsigned long const*, int const*, __half const*, __half*, __half*, int, int, int, int) in h_SpMM_API.cu:120 [0x13f81] in spmm_test
========= Host Frame: SpMM_SplitK_API_bitmap_v4(CUstream_st*, __half const*, __half const*, int const*, int const*, int const*, unsigned long const*, int const*, __half const*, __half*, int, int, int, __half*, int) in h_SpMM_API.cu:158 [0x132b5] in spmm_test
========= Host Frame: main in spmm_test.cu:576 [0xc3e7] in spmm_test
- According to the log, the crash happens at
h_SpMM_Kernel.cuh:1499. If I uncomment lines 1474–1497, then the crash moves to line 1474, still showing an illegal instruction.
I believe the tma_copy_3d function itself should be correct, as I based it on this template:
👉 MatmulTutorial/examples/matmul/this-sm90/matmul-pingpong-v1.cu at main · KnowingNothing/MatmulTutorial · GitHub
I suspect the issue could be related to an invalid operand passed to a low-level instruction, but I’m not sure how to verify this.
Build Configuration:
# Environment
CUDA_PATH = /data/home/tester/zkg/local/cuda-12.9/
nvcc = $(CUDA_PATH)/bin/nvcc
SM = 90a
# Compilation flags
NVCCFLAGS += --std=c++17 --threads 0 --use_fast_math -maxrregcount=255
NVCCFLAGS += --ptxas-options=-v,-warn-lmem-usage,--warn-on-spills
NVCCFLAGS += -g -G -lineinfo
GENCODE_FLAGS += -gencode arch=compute_$(SM),code=sm_$(SM)
# Linking libraries
LIBRARIES += -lcuda -lcublas -lcusparse
# (Optional) -lcusparseLt if needed


