Why I can not enter correct upper function using cuda-gdb?

202476410arsmart · December 4, 2023, 3:29am

I am running this code:
this

/temp_can$ nvcc -o sgemm_nt_1 sgemm_nt_1.cu -arch=sm_80 -std=c++17 -I ../cutlass/include -I ../cutlass/tools/util/include --expt-relaxed-constexpr -g -G
[1]+  Killed                  cuda-gdb sgemm_nt_1

:~/temp_can$ 
:~/temp_can$ cuda-gdb sgemm_nt_1
NVIDIA (R) CUDA Debugger
11.7 release
Portions Copyright (C) 2007-2022 NVIDIA Corporation
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

--Type <RET> for more, q to quit, c to continue without paging--
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from sgemm_nt_1...
(cuda-gdb) break sgemm_nt_1.cu:202
Breakpoint 1 at 0xdb6a: file sgemm_nt_1.cu, line 238.
(cuda-gdb) run
Starting program: /home/zyhuang/temp_can/sgemm_nt_1 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 115566]
[New Thread 0x7fffdffff000 (LWP 115570)]
[New Thread 0x7fffdf7fe000 (LWP 115571)]
Using device 0: NVIDIA A100 80GB PCIe  (SM80, 108 SMs)
M = 5120
N = 5120
K = 4096
Verification by comparison with cuBLAS is disabled, either because the CMake option CUTLASS_ENABLE_CUBLAS was explicitly set to OFF, or because CMake could not find cuBLAS.  If you would like to enable verification with cuBLAS, please set the CMake option CUTLASS_ENABLE_CUBLAS to ON, rerun CMake, and recompile this example.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (224,0,0), device 0, sm 0, warp 7, lane 0]

Thread 1 "sgemm_nt_1" hit Breakpoint 1, gemm_device<int, int, int, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<32>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<32> > >, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<32>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<32> > >, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<16>, cute::C<16> >, cute::tuple<cute::C<1>, cute::C<16> > >, float, float>
   <<<(40,40,1),(256,1,1)>>> (M=5120, N=5120, K=4096, A=0x7fff84000000, dA=..., blockA=..., tA=..., 
    B=0x7fff96000000, dB=..., blockB=..., tB=..., C=0x7fff8e000000, dC=..., tC=..., alpha=1, beta=0)
--Type <RET> for more, q to quit, c to continue without paging--
    at sgemm_nt_1.cu:202
202         copy(tAgA(_,_,k), tAsA);
(cuda-gdb) s
cute::Tensor<cute::ViewEngine<cute::gmem_ptr<float const*> >, cute::Layout<cute::tuple<cute::C<4>, cute::C<1>, int>, cute::tuple<cute::C<32>, cute::C<0>, int> > >::operator()<cute::Underscore, cute::Underscore, int>
   <<<(40,40,1),(256,1,1)>>> (this=0x7ffff2fffc30, c0=..., c1=..., cs=<error reading variable>)
    at /home/zyhuang/temp_can/../cutlass/include/cute/tensor.hpp:239
239         return operator()(make_coord(c0,c1,cs...));
(cuda-gdb)

The code here is:

template <class SrcEngine, class SrcLayout,
          class DstEngine, class DstLayout>
CUTE_HOST_DEVICE
void
copy(Tensor<SrcEngine, SrcLayout> const& src,
     Tensor<DstEngine, DstLayout>      & dst)
{
  constexpr int N = decltype(max_common_vector(src, dst))::value;

#if 0
  if (thread0()) {
    print("copy -- found a max_common_vector of %d\n", N);
    print("   "); print(src.data()); print(" o "); print(layout(src)); print("\n");
    print("   "); print(dst.data()); print(" o "); print(layout(dst)); print("\n");
  }
#endif

  if constexpr (N <= 1) {
    return copy_if(TrivialPredTensor{}, src, dst);
  } else {
    constexpr int vec_bits = N * sizeof_bits<typename SrcEngine::value_type>::value;
    using VecType = uint_bit_t<cute::min(128, vec_bits)>;
    return copy_vec<VecType>(src, dst);
  }
}

So the gdb should direct me into in-depth copy, but not “return operator()(make_coord(c0,c1,cs…));”

Why? Is it my usage problem? Thank you!!!

agontarek · December 4, 2023, 4:36pm

Hello! Thanks for reaching out.

What happens if you continuing stepping through the code?

To me it appears that this is expected behavior. You are calling the copy method and providing arguments of tAgA(_,_,k) and tAsA. The tAgA object is of a type Tensor class which has an op() defined. Before we can call the copy method, we must first resolve the arguments which is what you are seeing happen with step.

I believe if you continue stepping you should eventually wind up in the copy method code.

202476410arsmart · December 5, 2023, 2:36am

Oh…OK…?
Well, you know gdb can be slow… you mean I need to dive into each arguements? That will take a long time… Can I skip this step and directly enter the function itself?

202476410arsmart · December 5, 2023, 2:51am

You know, I have many function calls, sometimes it will work like this, enter a specific argument, but mostly will just dive into the function. When will it enter an argument?

202476410arsmart · December 5, 2023, 6:37am

(cuda-gdb) break sgemm_nt_1.cu:144
Breakpoint 1 at 0xdb5e: file sgemm_nt_1.cu, line 240.
(cuda-gdb) run
Starting program: /home/zyhuang/temp_can/sgemm_nt_1 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 126826]
[New Thread 0x7fffdffff000 (LWP 126830)]
[New Thread 0x7fffdf7fe000 (LWP 126831)]
Using device 0: NVIDIA A100 80GB PCIe  (SM80, 108 SMs)
M = 5120
N = 5120
K = 4096
Verification by comparison with cuBLAS is disabled, either because the CMake option CUTLASS_ENABLE_CUBLAS was explicitly set to OFF, or because CMake could not find cuBLAS.  If you would like to enable verification with cuBLAS, please set the CMake option CUTLASS_ENABLE_CUBLAS to ON, rerun CMake, and recompile this example.
[Switching focus to CUDA kernel 0, grid 1, block (0,3,0), thread (160,0,0), device 0, sm 13, warp 5, lane 0]

Thread 1 "sgemm_nt_1" hit Breakpoint 1, gemm_device<int, int, int, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tu--Type <RET> for more, q to quit, c to continue without paging--
ple<cute::C<32>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<32> > >, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<32>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<32> > >, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<16>, cute::C<16> >, cute::tuple<cute::C<1>, cute::C<16> > >, float, float>
   <<<(40,40,1),(256,1,1)>>> (M=5120, N=5120, K=4096, A=0x7fffb6000000, dA=..., blockA=..., tA=..., 
    B=0x7fffbb000000, dB=..., blockB=..., tB=..., C=0x7fffc0000000, dC=..., tC=..., alpha=1, beta=0)
    at sgemm_nt_1.cu:144
144       if(thread0()) {
(cuda-gdb) step
0x00007fffdecbf0b0 in cute::size<2, cute::ViewEngine<cute::gmem_ptr<float const*> >, cute::Layout<cute::tuple<cute::C<4>, cute::C<1>, int>, cute::tuple<cute::C<32>, cute::C<0>, int> > ><<<(40,40,1),(256,1,1)>>> (
    tensor=...) at sgemm_nt_1.cu:155
155         print("\n\n");
(cuda-gdb) step
199       auto k_max = size<2>(tAgA);
(cuda-gdb) s
539       return size<Is...>(tensor.layout());
(cuda-gdb) s
cute::Tensor<cute::ViewEngine<cute::gmem_ptr<float const*> >, cute::Layout<cute::tuple<cute::C<4>, cute::C<1>, int>, cute::tuple<cute::C<32>, cute::C<0>, int> > >::layout (this=0x7ffff2fffb48)
    at /home/zyhuang/temp_can/../cutlass/include/cute/tensor.hpp:142
142         return get<0>(rep_);
(cuda-gdb) s
cute::get<0ul, cute::Layout<cute::tuple<cute::C<4>, cute::C<1>, int>, cute::tuple<cute::C<32>, cute::C<0>, int> >, cute::ViewEngine<cute::gmem_ptr<float const*> > > (t=...)
    at /home/zyhuang/temp_can/../cutlass/include/cute/container/tuple.hpp:204
204       return detail::getv<I>(t);
(cuda-gdb) s
cute::Tensor<cute::ViewEngine<cute::gmem_ptr<float const*> >, cute::Layout<cute::tuple<cute::C<4>, cute::C<1>, int>, cute::tuple<cute::C<32>, cute::C<0>, int> > >::layout (this=0x7ffff2fffb48)
    at /home/zyhuang/temp_can/../cutlass/include/cute/tensor.hpp:142
142         return get<0>(rep_);
(cuda-gdb) s
cute::size<2, cute::ViewEngine<cute::gmem_ptr<float const*> >, cute::Layout<cute::tuple<cute::C<4>, cute::C<1>, int>, cute::tuple<cute::C<32>, cute::C<0>, int> > ><<<(40,40,1),(256,1,1)>>> (tensor=...)
    at /home/zyhuang/temp_can/../cutlass/include/cute/tensor.hpp:539
539       return size<Is...>(tensor.layout());

Still for this sgemm code, I want to know what is thread0 (what is its detailed definition) Using VScode, press ctrl, I can find maybe it is from cutlass > include > cute >util > debug.hpp, but how to use cuda-gdb to find it? Thank you!!!

#if 1
  if(thread0()) {
    print("mA\n");

202476410arsmart · December 5, 2023, 6:38am

the “if(thread0)” is line 144, and as shown above, I put a break point here.

agontarek · December 15, 2023, 4:21pm

Still for this sgemm code, I want to know what is thread0 (what is its detailed definition) Using VScode, press ctrl, I can find maybe it is from cutlass > include > cute >util > debug.hpp, but how to use cuda-gdb to find it? Thank you!!!

When you set a breakpoint at line 144 and run execution until you hit that point, you can use ptype thread0. It will show you the prototype for the function call. We don’t support dynamic function calls used in gdb expressions with cuda-gdb on the device today. I expect the following will fail: print thread0().

veraj · January 19, 2024, 7:27am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How can I dive into a function using cuda-gdb? CUDA-GDB	4	1096	January 22, 2024
Cuda-gdb doesn't break and/or step into Kernels CUDA Programming and Performance	26	54204	August 1, 2011
Single stepping until exit from function, which has no line number information CUDA-GDB	5	2403	March 13, 2024
Unable to unwind stack frame ...when debugging with cuda-gdb CUDA Programming and Performance	5	1537	July 5, 2010
Anomalies with __device__ functions. Or is cuda-gdb playing stupid? CUDA Programming and Performance	0	3530	November 9, 2011
cuda-gdb CUDA Programming and Performance	2	8235	January 7, 2010
cuda-gdb cannot break in device code CUDA Programming and Performance	2	1931	April 12, 2011
Walking through 2.3 Examlpes programs source code Stopping at subprogram's source code CUDA Programming and Performance	8	4669	January 15, 2010
cuda-gdb 3.0:how to step into the kernel help me~~ CUDA Programming and Performance	0	5248	May 10, 2010
cuda-gdb problem with function call in kernel? CUDA Programming and Performance	0	2107	December 6, 2011

Why I can not enter correct upper function using cuda-gdb?

Related topics