Thank you! It works!!! Mostly…
So CUDA-GDB finds the function step by step, but failed at last step… Like below:
(cuda-gdb) break sgemm_nt_1.cu:210
Breakpoint 1 at 0xd907: file sgemm_nt_1.cu, line 222.
(cuda-gdb) run
Starting program: /home/zyhuang/temp_can/sgemm_nt_1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
cuda-gdb failed to grab the lock file /tmp/cuda-dbg/cuda-gdb.lock.
Another CUDA debug session (pid 97601) could be in progress.
Are you sure you want to continue? (y or [n]) y
[Detaching after fork from child process 98556]
[New Thread 0x7fffdffff000 (LWP 98560)]
[New Thread 0x7fffdf7fe000 (LWP 98561)]
Using device 0: NVIDIA A100 80GB PCIe (SM80, 108 SMs)
M = 5120
N = 5120
K = 4096
Verification by comparison with cuBLAS is disabled, either because the CMake option CUTLASS_ENABLE_CUBLAS was explicitly set to OFF, or because CMake could not find cuBLAS. If you would like to enable verification with cuBLAS, please set the CMake option CUTLASS_ENABLE_CUBLAS to ON, rerun CMake, and recompile this example.
ahahahaahahhahahahhhhhhhhhhhhhhh
[Switching focus to CUDA kernel 0, grid 1, block (2,7,0), thread (0,0,0), device 0, sm 66, warp 18, lane 0]
Thread 1 "sgemm_nt_1" hit Breakpoint 1, gemm_device<int, int, int, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<32>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<32> > >, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<32>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<32> > >, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<16>, cute::C<16> >, cute::tuple<cute::C<1>, cute::C<16> > >, float, float>
<<<(40,40,1),(256,1,1)>>> (M=-218129807, N=32767, K=0, A=0x7fff84000000, dA=..., blockA=..., tA=...,
B=0x7fff96000000, dB=..., blockB=..., tB=..., C=0x7fff8e000000, dC=..., tC=..., alpha=0,
beta=-1.0125765e+31) at sgemm_nt_1.cu:210
210 gemm(tCsA, tCsB, tCrC);
(cuda-gdb) s
cute::gemm<cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > > ><<<(40,40,1),(256,1,1)>>> (
A=..., B=..., C=...) at /home/zyhuang/temp_can/../cutlass/include/cute/algorithm/gemm.hpp:74
74 return gemm(C, A, B, C);
(cuda-gdb) s
cute::gemm<cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > > > (D=..., A=...,
B=..., C=...) at /home/zyhuang/temp_can/../cutlass/include/cute/algorithm/gemm.hpp:171
171 return gemm(MMA{}, D, A, B, C);
(cuda-gdb) s
cute::gemm<cute::UniversalFMA<float, float, float, float>, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, (void*)0> (mma=..., D=..., A=..., B=..., C=...)
at /home/zyhuang/temp_can/../cutlass/include/cute/algorithm/gemm.hpp:454
454 gemm(mma,
(cuda-gdb) s
455 make_tensor(D.data(), prepend<3>(D.layout())), // (1,M,N)
(cuda-gdb)
You see, here the last function is 454 line’s gemm(mma, … And actually you can see it should be directed to here like you told me. Why not? (Maybe some bugs in cuda-gdb?)