Cuda-gdb segmentation fault when attaching pytorch training process

Hi folks.

I am using cuda-gdb to attach a pytorch training job, but once i attach it, the process raising segmentation fault. How can I handle it? Thanks.

Env is

name value
gpu NVIDIA A100-SXM4-80GB
host driver 470.82.01
container cuda version 12.1
container cuda compat version cuda-compat-12-1-530.30.02-1
torch 2.1.0+cu12.1

cuda-gdb version

NVIDIA (R) CUDA Debugger
CUDA Toolkit 12.1 release
Portions Copyright (C) 2007-2023 NVIDIA Corporation
GNU gdb (GDB) 12.1

core stack of attached process

#0  0x00007f361679b86a in ?? () from /usr/lib64/libcuda.so.1
#1  0x00007f361682829a in ?? () from /usr/lib64/libcuda.so.1
#2  0x00007f361682a27a in ?? () from /usr/lib64/libcuda.so.1
#3  0x00007f3616730e2d in ?? () from /usr/lib64/libcuda.so.1
#4  0x00007f36168db135 in ?? () from /usr/lib64/libcuda.so.1
#5  0x00007f362148bc79 in __cudart1043 () from /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so
#6  0x00007f36214c6275 in cudaDeviceSynchronize () from /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so
#7  0x00007f362147b1e4 in c10::cuda::device_synchronize() () from /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so

asm of frame 0

   0x7f361679b84a:      and    $0x24,%al
   0x7f361679b84c:      add    $0x8b492774,%eax
   0x7f361679b851:      rex.R and $0x48,%al
   0x7f361679b854:      mov    0x43a0(%rax),%ecx
   0x7f361679b85a:      test   %ecx,%ecx
   0x7f361679b85c:      je     0x7f361679bad0
   0x7f361679b862:      pause  
   0x7f361679b864:      mov    %rbx,%rsi
   0x7f361679b867:      mov    %rbp,%rdi
=> 0x7f361679b86a:      callq  0x7f3616af3090
   0x7f361679b86f:      mov    %eax,%r14d
   0x7f361679b872:      test   %eax,%eax
   0x7f361679b874:      je     0x7f361679b848
   0x7f361679b876:      mov    0x1c(%rsp),%eax
   0x7f361679b87a:      movl   $0x8,0x28(%rsp)
   0x7f361679b882:      mov    %eax,0x2c(%rsp)

(gdb) x/10i 0x7f3616af3090                                                                                                                                                                          
   0x7f3616af3090:      push   %r15                                                                                                                                                                 
   0x7f3616af3092:      mov    %rsi,%r15                                                                                                                                                            
   0x7f3616af3095:      push   %r14                                                                                                                                                                 
   0x7f3616af3097:      push   %r13                                                                                                                                                                 
   0x7f3616af3099:      push   %r12                                                                                                                                                                 
   0x7f3616af309b:      push   %rbp                                                                                                                                                                 
   0x7f3616af309c:      mov    %rdi,%rbp                                                                                                                                                            
   0x7f3616af309f:      push   %rbx                                                                                                                                                                 
   0x7f3616af30a0:      sub    $0x28,%rsp                                                                                                                                                           
   0x7f3616af30a4:      mov    (%rsi),%rax                                                                                                                                                          
(gdb) i r                           
rax            0x1bed76c2          468547266
rbx            0x7ffcc0a613a0      140723540595616    
rcx            0x7ffcc0b13b12      140723541326610    
rdx            0x0                 0
rsi            0x7ffcc0a613a0      140723540595616
rdi            0x7ffcc0a61314      140723540595476
rbp            0x7ffcc0a61314      0x7ffcc0a61314
rsp            0x7ffcc0a612f0      0x7ffcc0a612f0
r8             0x7b46420           129262624
r9             0x100000000         4294967296
r10            0xffffffff00000000  -4294967296              
r11            0x293               659
r12            0x77f6190           125788560
r13            0x1                 1              
r14            0x0                 0              
r15            0x7ffcc0a61320      140723540595488
rip            0x7f361679b86a      0x7f361679b86a 
eflags         0x10246             [ PF ZF IF RF ]
cs             0x33                51            
ss             0x2b                43            
ds             0x0                 0        
es             0x0                 0         
fs             0x0                 0          
gs             0x0                 0
k0             0x3f8000003f800000  4575657222473777152
k1             0x3f8000003f800000  4575657222473777152
k2             0x0                 0
k3             0x0                 0
k4             0x0                 0
k5             0x0                 0
k6             0x0                 0
k7             0x0                 0

cuda-gdb is also segv on cuda-12.5, I am using ngc container. But I build cuda-gdb from source in container, it can work.

Thank you for reporting this issue.
Can you post the Dockerfile used to set up your container? I want to make sure I’m looking at the right configuration.
You mention above that you’re using CUDA Toolkit version 12.1, which is pretty old. From the backtrace the segfault is in libcuda.so. Can you provide the output of the following command (from within your container)?

ls -l /usr/lib64/libcuda.*

Thanks for your reply.

This is script how I install cuda.

echo "/usr/local/cuda/lib" >>/etc/ld.so.conf.d/nvidia.conf &&
    echo "/usr/local/cuda/lib64" >>/etc/ld.so.conf.d/nvidia.conf &&
    echo "/usr/lib64" >>/etc/ld.so.conf.d/nvidia.conf &&
    echo "ldconfig > /dev/null 2>&1 " >>/etc/bashrc

cuda_base=(
    cuda-cudart
    cuda-nvrtc
    cuda-nvcc
    cuda-nvprune
    cuda-driver-devel
    cuda-cuobjdump
    cuda-nvtx
    cuda-nvdisasm
    cuda-compat
    cuda-cccl
    cuda-nvrtc-devel
    cuda-cudart-devel
    cuda-cupti
    cuda-gdb
    cuda-nvml-devel
    cuda-nvprof
    cuda-profiler-api
    cuda-sanitizer
    libcublas
    libcurand
    libcusparse
    libcusparse-devel
    libcurand-devel
    libcufile-devel
    libcufile
    libcufft
    libcusolver
    libcusolver-devel
    libcufft-devel
    libcublas-devel
    libnvjitlink
)

cuda_base=("${cuda_base[@]/%/-12-1}")
cuda_base+=(libcudnn8-8.9.3.28 libcudnn8-devel-8.9.3.28)
cuda_base+=(libnccl-devel-2.18.3-1+cuda12.1 libnccl-2.18.3-1+cuda12.1)
yum install -y ${cuda_base[@]}

Cuda is using cuda-compat-12-1-530.30.02-1, maybe torch is not compiled with this version?

#ls -l /usr/lib64/libcuda.*
lrwxrwxrwx 1 root root       12 Aug 21 09:19 /usr/lib64/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       20 Aug 21 09:19 /usr/lib64/libcuda.so.1 -> libcuda.so.530.30.02
-rwxr-xr-x 1 root root 24139176 Sep  6  2022 /usr/lib64/libcuda.so.470.82.01
-rwxr-xr-x 1 root root 29900840 Feb 22  2023 /usr/lib64/libcuda.so.530.30.02

The segv on python(3.8), if I use cuda-gdb-minimal, it works fine. So i rebuild cuda-gdb with my local python, it works.

I build gdb-cuda12.1 from source, and another error occurs on python 3.8.19

python/py-unwind.c:658: internal-error: pyuw_sniffer: Assertion `PyTuple_Check (pyo_execute_ret.get ())' failed.                                                                                                                                                                                                                
A problem internal to GDB has been detected,                                                                                                                                                                                                                                                                                    
further debugging may prove unreliable.                                                                                                                                                                                                                                                                                         
----- Backtrace -----                                                                                                                                                                                                                                                                                                           
0xc642be ???                                                                                                                                                                                                                                                                                                                    
0xc64361 ???                                                                                                                                                                                                                                                                                                                    
0x123b7d9 ???                                                                                                                                                                                                                                                                                                                   
0x123baf9 ???                                                                                                                                                                                                                                                                                                                   
0x1409daa ???                                                                                                                                                                                                                                                                                                                   
0x10aa995 ???                                                                                                                                                                                                                                                                                                                   
0xedeae0 ???                                                                                                                                                                                                                                                                                                                    
0xeded73 ???                                                                                                                                                                                                                                                                                                                    
0xee52ec ???                                                                                                                                                                                                                                                                                                                    
0x11653ce ???                                                                                                                                                                                                                                                                                                                   
0x1163a28 ???                                                                                                                                                                                                                                                                                                                   
0xf56c22 ???                                                                                                                                                                                                                                                                                                                    
0xf56ca7 ???                                                                                                                                                                                                                                                                                                                    
0x1215887 ???                                                                                                                                                                                                                                                                                                                   
0xb925ee ???                                                                                                                                                                                                                                                                                                                    
0xf5ca8d ???                                                                                                                                                                                                                                                                                                                    
0xf5bc8b ???                                                                                                                                                                                                                                                                                                                    
0xf575d6 ???                                                                                                                                                                                                                                                                                                                    
0xf39b18 ???                                                                                                                                                                                                                                                                                                                    
0xf39b57 ???                                                                                                                                                                                                                                                                                                                    
0xf3b616 ???                                                                                                                                                                                                                                                                                                                    
0xcdd007 ???                                                                                                                                                                                                                                                                                                                    
0xf3d1ed ???                                                                                                                                                                                                                                                                                                                    
0xf2d411 ???                                                                                                                                                                                                                                                                                                                    
0xf4c61e ???                                                                                                                                                                                                                                                                                                                    
0xf2d3ad ???                                                                                                                                                                                                                                                                                                                    
0xf94b74 ???                                                                                                                                                                                                                                                                                                                    
0x140aac9 ???                                                                                                                                                                                                                                                                                                                   
0x140b051 ???                                                                                                                                                                                                                                                                                                                   
0x1409ee5 ???                                                                                                                                                                                                                                                                                                                   
0x11ec423 ???                                                                                                                                                                                                                                                                                                                   
0x11ec4b8 ???                                                                                                                                                                                                                                                                                                                   
0xfc47fb ???                                                                                                                                                                                                                                                                                                                    
0xfc5ad7 ???                                                                                                                                                                                                                                                                                                                    
0xfc601b ???                                                                                                                                                                                                                                                                                                                    
0xfc6086 ???                                                                                                                                                                                                                                                                                                                    
0xb329cc ???                                                                                                                                                                                                                                                                                                                    
0x7f62ac731192 ???                                                                                                                                                                                                                                                                                                              
0xb328cd ???                                                                                                                                                                                                                                                                                                                    
0xffffffffffffffff ???                                                                                                                                                                                                                                                                                                          
---------------------                                                                                                                                                                                                                                                                                                           
python/py-unwind.c:658: internal-error: pyuw_sniffer: Assertion `PyTuple_Check (pyo_execute_ret.get ())' failed.                                                                                                                                                                                                                
A problem internal to GDB has been detected,                                                                                                                                                                                                                                                                                    
further debugging may prove unreliable.                                                                                                                                                                                                                                                                                         
Quit this debugging session? (y or n) y  

And I found that in gdb/python/py-unwind.c

  gdbpy_ref<> pyo_execute_ret
#ifndef NVIDIA_CUDA_GDB
    (PyObject_CallFunctionObjArgs (pyo_execute.get (),
#else
    (gdbpy_PyObject_CallFunctionObjArgs (pyo_execute.get (),
#endif
                   pyo_pending_frame.get (), NULL));
  if (pyo_execute_ret == nullptr)
    {
      /* If the unwinder is cancelled due to a Ctrl-C, then propagate
     the Ctrl-C as a GDB exception instead of swallowing it.  */
      gdbpy_print_stack_or_quit ();
      return 0;
    }
#ifndef NVIDIA_CUDA_GDB
  if (pyo_execute_ret == Py_None)
#else
  if (pyo_execute_ret == gdbpy_None)
#endif
    return 0;

  /* Verify the return value of _execute_unwinders is a tuple of size 2.  */
  gdb_assert (PyTuple_Check (pyo_execute_ret.get ()));  // <----- error here
  gdb_assert (PyTuple_GET_SIZE (pyo_execute_ret.get ()) == 2);

and in gdb/python/lib/gdb/__init__.py this return None


def _execute_unwinders(pending_frame):
    """Internal function called from GDB to execute all unwinders.

    Runs each currently enabled unwinder until it finds the one that
    can unwind given frame.

    Arguments:
        pending_frame: gdb.PendingFrame instance.

    Returns:
        Tuple with:

          [0] gdb.UnwindInfo instance
          [1] Name of unwinder that claimed the frame (type `str`)

        or None, if no unwinder has claimed the frame.
    """
    for objfile in objfiles():
        for unwinder in objfile.frame_unwinders:
            if unwinder.enabled:
                unwind_info = unwinder(pending_frame)
                if unwind_info is not None:
                    return (unwind_info, unwinder.name)

    for unwinder in current_progspace().frame_unwinders:
        if unwinder.enabled:
            unwind_info = unwinder(pending_frame)
            if unwind_info is not None:
                return (unwind_info, unwinder.name)

    for unwinder in frame_unwinders:
        if unwinder.enabled:
            unwind_info = unwinder(pending_frame)
            if unwind_info is not None:
                return (unwind_info, unwinder.name)

    return None

, but seems gdb/python/py-unwind.c compare error with

#ifndef NVIDIA_CUDA_GDB
  if (pyo_execute_ret == Py_None)
#else
  if (pyo_execute_ret == gdbpy_None)
#endif
    return 0;

Since _execute_unwinders return None, so I just return 0 in gdb/python/py-unwind.c

--- py-unwind.c.bak     2024-09-19 17:40:57.567720586 +0800
+++ gdb/python/py-unwind.c      2024-09-19 18:07:03.695848852 +0800
@@ -647,6 +647,7 @@
       gdbpy_print_stack_or_quit ();
       return 0;
     }
+  return 0;
 #ifndef NVIDIA_CUDA_GDB
   if (pyo_execute_ret == Py_None)
 #else

But when I run bt in cuda-gdb, it raise error

Python Exception <class 'TypeError'>: 'NoneType' object is not iterable

So maybe the unwinder in cuda-gdb is error? Here are some basic info.

#./cuda-gdb --configuration
This GDB was configured as follows:
   configure --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu
             --with-auto-load-dir=$debugdir:$datadir/auto-load
             --with-auto-load-safe-path=$debugdir:$datadir/auto-load
             --with-expat
             --with-gdb-datadir=/data/cuda-gdb/share/gdb (relocatable)
             --with-jit-reader-dir=/data/cuda-gdb/lib/gdb (relocatable)
             --without-libunwind-ia64
             --with-lzma
             --without-babeltrace
             --without-intel-pt
             --with-mpfr
             --without-xxhash
             --with-python=/opt/conda
             --with-python-libdir=/opt/conda/lib
             --without-debuginfod
             --without-guile
             --disable-source-highlight
             --with-separate-debug-dir=/data/cuda-gdb/lib/debug (relocatable)


#./cuda-gdb -ex "python import sys; print(sys.version)" -ex "quit"                                                                                                                                                                                                                                                             
NVIDIA (R) CUDA Debugger
CUDA Toolkit 12.1 release
Portions Copyright (C) 2007-2023 NVIDIA Corporation
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
3.8.19 (default, Mar 20 2024, 20:06:08)
[GCC 11.2.0]