Single stepping until exit from function, which has no line number information

I try to debug two kernels in the same source file. One kernel I can successfully enter and print variables’ values. Another kernel reports: Single stepping until exit from function _ZN3gpu7kernels27modify_impl_inverses_kernelILi8EdjjEEvPPT0_PKS2_PKT2_S9_T1_S7_S7_, which has no line number information.

I can reproduce this problem on two servers with Ubuntu 22.04. My colleague has similar setup with me on similar machine with Ubuntu 22.04 but the same build debugs with no problems for them.
On the server with RH8 everything works.

andriy@server:/sandbox$ cuda-gdb --args   ../out/build/sim-sr2-debug/tests/test_drs_scaling  --gtest_filter=DRSScaling.TestDRSxMXSPR024
NVIDIA (R) CUDA Debugger
CUDA Toolkit 12.3 release
Portions Copyright (C) 2007-2023 NVIDIA Corporation
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../out/build/sim-sr2-debug/tests/test_drs_scaling...
(cuda-gdb) b aim_device_kernels.cuh:29
Breakpoint 1 at 0x107730: file /cmg/andriy/repos/gpusol.lib/src/../include/aim_device_kernels.cuh, line 36.
(cuda-gdb) b aim_device_kernels.cuh:196
Breakpoint 2 at 0x10e6e2: aim_device_kernels.cuh:196. (6 locations)
(cuda-gdb) run
Starting program: /cmg/andriy/repos/gpusol.lib/out/build/sim-sr2-debug/tests/test_drs_scaling --gtest_filter=DRSScaling.TestDRSxMXSPR024
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Running main() from /cmg/andriy/repos/gpusol.lib/external/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = DRSScaling.TestDRSxMXSPR024
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from DRSScaling
[ RUN      ] DRSScaling.TestDRSxMXSPR024
[New Thread 0x7fffb1f7b000 (LWP 315462)]
[New Thread 0x7fffb0c96000 (LWP 315463)]
[Detaching after fork from child process 315464]
AMGX version 2.4.0
Built on Feb 27 2024, 16:40:57
Compiled with CUDA Runtime 12.3, using CUDA driver 12.3

Reading system from file: /cmg/andriy/repos/gpusol.lib/sandbox/../matrices/IMEX/mxspr024[it=4 ncyc=1].hdf
[New Thread 0x7fffaafde000 (LWP 315476)]
Device ID: 0 -- Compute Capability : 9.0
                Number of SMs: 132; Warp Size: 32
Device ID: 0 -- Compute Capability : 9.0
                Number of SMs: 132; Warp Size: 32
[Switching focus to CUDA kernel 0, grid 9, block (0,0,0), thread (0,0,0), device 0, sm 128, warp 0, lane 0]

Thread 1 "test_drs_scalin" hit Breakpoint 1, gpu::kernels::fill_eqtrow<unsigned int, unsigned int><<<(3696,1,1),(32,1,1)>>> (rhs_offsets=0x7ffd83e02a00, num_rows=123, eqtrow=0x7ffd83e04c00) at /cmg/andriy/repos/gpusol.lib/src/../include/aim_device_kernels.cuh:29
29	        auto stride = blockDim.x * gridDim.x;
(cuda-gdb) n
31	        for (auto isol = index; isol < num_rows; isol += stride) {
(cuda-gdb) continue
Continuing.
[Switching focus to CUDA kernel 1, grid 19, block (0,0,0), thread (0,0,0), device 0, sm 128, warp 0, lane 0]

Thread 1 "test_drs_scalin" hit Breakpoint 1, gpu::kernels::fill_eqtrow<unsigned int, unsigned int><<<(3696,1,1),(32,1,1)>>> (rhs_offsets=0x7ffd83e16e00, num_rows=123, eqtrow=0x7ffd83e19000) at /cmg/andriy/repos/gpusol.lib/src/../include/aim_device_kernels.cuh:29
29	        auto stride = blockDim.x * gridDim.x;
(cuda-gdb) continue
Continuing.

Total number of unknowns = 607

SPMV Diag Kernel: MAX_BLOCKS = 4 NUM_BLOCKS = 38 THREADS_PER_BLOCK = 128
SPMV Diag Kernel: EQUATIONS_PER_BLOCK = 16 THREADS_PER_EQUATION = 8
SPMV Diag Kernel: Launched blocks of size 128. Theoretical occupancy: 0.250000
SPMV Kernel: ROWS_PER_BLOCK = 25 THREADS_PER_ROW = 5
SPMV Kernel: Max 9 blocks of size 128. Theoretical occupancy: 0.562500.
SPMV Kernel: Actual 5 blocks of size 128. Actual occupancy: 0.312500


PRE ABF: |b| =   2.3027702480588856E+05
PRE ABF: |b-A*x_sim| =   2.7582258389031086E+01
PRE ABF: |b-A*x_sim| / |b| =   1.1977859455271963E-04


DRS SCALING:=======Ranges of diagonal sizes=======
DRS SCALING: Range[1] of size 5 starts at block 0
DRS SCALING: Range[2] of size 1 starts at block 121
DRS SCALING: Total blocks = 123
[Switching focus to CUDA kernel 2, grid 45, block (0,0,0), thread (0,0,0), device 0, sm 128, warp 0, lane 0]

Thread 1 "test_drs_scalin" hit Breakpoint 2, 0x00007ffd3f6beaf0 in void gpu::kernels::modify_impl_inverses_kernel<8, double, unsigned int, unsigned int>(double**, double const*, unsigned int const*, unsigned int const*, unsigned int, unsigned int, unsigned int)<<<(4,1,1),(256,1,1)>>> ()
(cuda-gdb) n
Single stepping until exit from function _ZN3gpu7kernels27modify_impl_inverses_kernelILi8EdjjEEvPPT0_PKS2_PKT2_S9_T1_S7_S7_,
which has no line number information.
[Switching focus to CUDA kernel 3, grid 45, block (0,0,0), thread (32,0,0), device 0, sm 128, warp 1, lane 0]

Thread 1 "test_drs_scalin" hit Breakpoint 2, 0x00007ffd3f6beaf0 in void gpu::kernels::modify_impl_inverses_kernel<8, double, unsigned int, unsigned int>(double**, double const*, unsigned int const*, unsigned int const*, unsigned int, unsigned int, unsigned int)<<<(4,1,1),(256,1,1)>>> ()
(cuda-gdb) 

What can be the issue?
Is there something wrong with my environment?

Appreciate any suggestions.

Hi @andriy.roshchenko,
Thank you for reporting the issue! Could you please share additional information to help us identify the problem?

  • nvidia-smi command output (ideally on both machines - where debugging is working and where it’s not)
  • Additional logs from the debugger (ideally for both machines as well):
    • Add NVLOG_CONFIG_FILE variable pointing the nvlog.config file (attached). E.g.: NVLOG_CONFIG_FILE=${HOME}/nvlog.config
      nvlog.config (539 Bytes)

    • Run the debugging session.

    • You should see the /tmp/debugger.log file created - could you share it with us?

Hi @AKravets,

I am able to reproduce both states - when I am able to debug all kernels and when I can debug only some of them - on the same machine and the same code base.

In order to be able to debug all kernels I must enable these two snippets of code in the main .cu file:

#if defined(_DEBUG) && 1
    amgx::thrust::device_vector<aim_matrix::rhs_offset_type> rhs_offsets((aimm.RHSOffsets()), 
                       (aimm.RHSOffsets()+10));
    printf("%s (%d):\n", __FUNCTION__, __LINE__);
    for (size_t i = 0; i < 5; i+=2)
    {
        int a = rhs_offsets[i], b = rhs_offsets[i+1];
        printf("%d %d\n", a, b);
    }
#endif
#if defined(_DEBUG) && 1
    rhs_offsets.assign((aimm.RHSOffsets()), 
                       (aimm.RHSOffsets()+10));
    printf("%s (%d):\n", __FUNCTION__, __LINE__);
    for (size_t i = 0; i < 5; i+=2)
    {
        int a = rhs_offsets[i], b = rhs_offsets[i+1];
        printf("%d %d\n", a, b);
    }
#endif

Note. This code has nothing to do with the kernels I try to debug. It is located in a completely different unit of translation and is CPU code mostly.

@andriy.roshchenko
Thank you very much for the logs. We are investigating the issue.

i meet the same problem. Move my kernel to a single compile unit helps, but still cannot debug when the kernel is long.

One observation is when the kernel goes too long (although most lines are comments) the debugger just doesn’t work.