Cannot attach crashing down process via cuda-gdb

jacek.tomaka · November 7, 2022, 3:01pm

I do export CUDA_DEVICE_WAITS_ON_EXCEPTION=1 then execute my program, which crashes so i am trying to see what is wrong by cuda-gdb -p

Here is what i am getting:
The CUDA driver could not allocate operating system resources for attaching to the application.

An error occurred while in a function called from GDB.
Evaluation of the expression containing the function
(cudbgApiAttach) will be abandoned.
When the function is done executing, GDB will silently stop.

My versions:
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4
Kernel: 3.10.0-1062.18.1.1.el7.dug.x86_64 (standard Centos 7 kernel with minor, unrelated patches)

The shared library i am using is linking CUDA 11.2 statically.
Any ideas how to get to the bottom of it?

AKravets · November 7, 2022, 3:41pm

Hi @jacek.tomaka,
Thank you for your report! Could you please share additional details about the issue:

Can you start your application under debugger and run the debugging session after it crashes? This would allow us to determine whether it’s an attach issue or a generic debugging issue.
Output of the nvidia-smi command when application crashes. It should print the amount of GPU memory available.
Output of the free command when application crashes
dmesg output after failed attach

This error might also indicate that debugger (in-process) is hitting the open FD limit (i.e. cannot open/create file). If your application is opening a lot of files, you could also try increasing the limit.

jacek.tomaka · November 8, 2022, 2:46am

In this instance, this application did not even crash, well i am not certain what happens because can’t attach the debugger:
dmesg output after attempt to attach:

Tue Nov 8 10:46:00 2022] traps: java[143139] trap int3 ip:7ffe7b0ae9a0 sp:7ffe7b0ae990 error:0

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 143139 C …202211041153/jre/bin/java 23081MiB |
| 1 N/A N/A 143139 C …202211041153/jre/bin/java 22965MiB |
| 2 N/A N/A 143139 C …202211041153/jre/bin/java 22963MiB |
| 3 N/A N/A 143139 C …202211041153/jre/bin/java 22963MiB |
±----------------------------------------------------------------------------+

free -m
total used free shared buff/cache available
Mem: 191571 7187 61314 621 123069 182868

I will try to start an app from the debugger next.
BTW: could you grep for this error in the source code of the driver and CUDA to see what it tries to do?

jacek.tomaka · November 8, 2022, 3:24am

Looks like i can attach fine, even using cuda-gdb -p when job does not run under slurm.

jacek.tomaka · November 8, 2022, 3:40am

Are the cuda-gdb requirements documented somewhere?

AKravets · November 8, 2022, 11:50am

This error means that some system resource could not be allocated:

Could not allocate memory (malloc/calloc)
Could not create/open file
Could not launch a new process via execl)

We don’t have hard constrains for system resources, since the required memory amount depends on the debugged application. In general debugger expects to be able to allocate CPU and GPU memory, create temporary files on disk and launch processes.

I am less familiar with slurm - can running the job under slurm restrict its access to the resources listed above?

jacek.tomaka · November 10, 2022, 7:46am

How can i tell which one is it? I tried to strace cuda-gdb but it just hangs when i do it on ptrace(PTRACE_TRACEME) = -1 EPERM (Operation not permitted)

Sure. Is there any way to get information about which fails?

Yes, by the looks;) I am not sure how, though.

Is there a way to make cudbgApiAttach or the driver itself more talkative?
Could you give me the symbol name of the function that prints this: “The CUDA driver could not allocate operating system resources for attaching to the application.” ?

AKravets · November 10, 2022, 1:19pm

The failure happens in the debugged process (CUDA application) - part of the GPU debugging is implemented in libcuda.so (which is loaded into the CUDA application). You can try tracing it, but using cuda-gdb and strace on the same process might lead to a strange issues (we haven’t really tested such scenarios).

Unfortunately we don’t have such mechanism (in release builds), so you would need to rely on system-wide reporting:

This message is printed by cuda-gdb. You should be able to look at cuda-gdb sources here: Index of /compute/cuda/opensource/11.4.0

But the actual error, triggering this message comes from the driver (which is strictly proprietary/closed source).

jacek.tomaka · November 10, 2022, 2:15pm

Well yes, as mentioned above i have explored that option but it does not lead anywhere at least given my limited knowledge.
I figured as much as that cudbgApiAttach is defined in libcuda.so.1 but still, it’s the interactions with the driver is what is interesting here and i had no luck.

If i understand the interaction, the real error is in the kernel. And the tools you propose concern the userspace(except dmesg which is clean in this case). What am i missing here?

Oh, right! This will get me going for a bit. Thanks!

BTW: This is not related?

github.com

NVIDIA/open-gpu-kernel-modules/blob/d8f3bcff924776518f1e63286537c3cf365289ac/src/nvidia/generated/g_kernel_sm_debugger_session_nvoc.c#L148


      
              __nvoc_init_RsSession(&pThis->__nvoc_base_RsSession);
              __nvoc_init_funcTable_RmDebuggerSession(pThis);
          }
          
          NV_STATUS __nvoc_objCreate_RmDebuggerSession(RmDebuggerSession **ppThis, Dynamic *pParent, NvU32 createFlags) {
              NV_STATUS status;
              Object *pParentObj;
              RmDebuggerSession *pThis;
          
              pThis = portMemAllocNonPaged(sizeof(RmDebuggerSession));
              if (pThis == NULL) return NV_ERR_NO_MEMORY;
          
              portMemSet(pThis, 0, sizeof(RmDebuggerSession));
          
              __nvoc_initRtti(staticCast(pThis, Dynamic), &__nvoc_class_def_RmDebuggerSession);
          
              if (pParent != NULL && !(createFlags & NVOC_OBJ_CREATE_FLAGS_PARENT_HALSPEC_ONLY))
              {
                  pParentObj = dynamicCast(pParent, Object);
                  objAddChild(pParentObj, &pThis->__nvoc_base_RsSession.__nvoc_base_RsShared.__nvoc_base_Object);
              }

AKravets · November 10, 2022, 2:19pm

Not exactly. There are three layers here:

Application (userspace)
libcuda.so (userspace) - the error originates here
Nvidia kernel driver

The error you are seeing originates in libcuda.so (which is a normal userspace shared library)

jacek.tomaka · November 11, 2022, 2:03am

Thanks.
Do you mean from cudbgReportedDriverInternalErrorCode? Which returns CUDBG_ERROR_OS_RESOURCES = 0x0025 in the case of interest?

Any chances you could provide me with libcuda.so.1 compatible with 11.4 that prints more information when this happens? Or at least a list of functions i could set the breakpoints at to see which one fails.

AKravets · November 11, 2022, 9:14am

Yes, this error is returned from libcuda.so (from inside the debugger process) when it cannot allocate system resources.

Unfortunately, it’s not possible. Also only the top level entry function (cudbgApiAttach) is exposed from libcuda.so, the rest of the functions are internal and are not visible from outside.

jacek.tomaka · November 12, 2022, 8:17am

Ok, what exactly do you suggest here i do? What commands should i run to provide you with information allowing to diagnose the issue?

There is nothing in dmesg, strace does not work.

AKravets · November 14, 2022, 9:57am

Since the debugging is working without slurm you might need to work withslurm configuration to figure out what resources are restricted (compared to running the application without slurm).

CUDA debugger doesn’t have any built-in capabilities for detailed reporting of which system call has failed, so you would need to rely on system tools.

Have you tried tracing malloc? (Tracing malloc (The GNU C Library) )
Can you check that slurm jobs can create files in /tmp?
Can you check that slurm jobs can launch binaries from /tmp?

jacek.tomaka · November 16, 2022, 3:06am

The issue is that cuda-gdb sets up session dir in a subdirectory of it’s own $TMPDIR and not debugee’s TMPDIR. When the two are different (as is the case of running under slurm) creat call to create cudbgprocess file fails.
Note that this can be reproduced even on local machine.

gdb attached to cuda-gdb:

gdb) bt
#0  0x00002b3ddd5346a0 in mkdir () from /lib64/libc.so.6
#1  0x0000000000532f6c in cuda_gdb_dir_create(char const*, unsigned int, bool, bool*) ()
#2  0x0000000000533682 in cuda_gdb_tmpdir_setup() ()
#3  0x0000000000533fcd in cuda_utils_initialize() ()
#4  0x0000000000531eb9 in cuda_initialize_target() ()
#5  0x000000000051f362 in cuda_nat_attach() ()
#6  0x00000000005e20e4 in attach_post_wait(char const*, int, attach_post_wait_mode) [clone .isra.41] ()
#7  0x00000000004df815 in do_all_inferior_continuations(int) ()
#8  0x00000000005db590 in inferior_event_handler(inferior_event_type, void*) ()
#9  0x00000000005f4716 in fetch_inferior_event(void*) ()
#10 0x0000000000597c4c in gdb_do_one_event() [clone .part.6] ()
#11 0x0000000000597d1e in start_event_loop() ()
#12 0x000000000062ed58 in captured_command_loop() ()
#13 0x000000000062fc6d in gdb_main(captured_main_args*) ()
#14 0x000000000040d705 in main ()
(gdb) print (char *) $rdi
$1 = 0x7ffe3dc88270 "/dev/shm/cuda-dbg/190882"

cuda-gdb attached to crashing cuda process:
When the function is done executing, GDB will silently stop.

(cuda-gdb) bt
#0  0x00002b832c297140 in creat64 () from /lib64/libc.so.6
#1  0x00002b832c9a27ec in ?? () from /lib64/libcuda.so.1
#2  <function called from gdb>
#3  0x00007ffc799cf6c2 in clock_gettime ()
#4  0x00002b832c2bb7ad in clock_gettime () from /lib64/libc.so.6
#5  0x00002b832c88b27f in ?? () from /lib64/libcuda.so.1
#6  0x00002b832c7f33d3 in ?? () from /lib64/libcuda.so.1
#7  0x00002b832c75413f in ?? () from /lib64/libcuda.so.1
#8  0x00002b832c755788 in ?? () from /lib64/libcuda.so.1
#9  0x00002b832c7b8faa in ?? () from /lib64/libcuda.so.1
#10 0x00000000004051a7 in __cudart1030 ()
#11 0x00000000004359a5 in cudaDeviceSynchronize ()
#12 0x0000000000403b10 in main ()
(cuda-gdb) print (char *) $rdi
$1 = 0x7ffc7998d800 "/tmp/tmpdir.47888517//cuda-dbg/190882/session1/cudbgprocess"

Note that this problem would have been identified in minutes and not days if the error was reported properly (indicating exactly what resource allocation failed)

jacek.tomaka · November 16, 2022, 3:39am

If it is not clear something like this would reproduce the problem:

mkdir -p $TMPDIR/xxx
CUDA_DEVICE_WAITS_ON_EXCEPTION=1 TMPDIR=$TMPDIR/xxx crashing_program
cuda-gdb -p $CRASHING_PROGRAM_PID

AKravets · November 16, 2022, 11:25am

Hi @jacek.tomaka,
Thank you very much for getting to the bottom of it! I have recorded the problem in our bug tracker - it will be addressed in one of the upcoming CUDA toolkit releases.

Is there anything else I can help you with? (or can we mark the topic as resolved)?

jacek.tomaka · November 16, 2022, 2:54pm

Hey @AKravets,

Is there anything else I can help you with? (or can we mark the topic as resolved)?

A fixed cuda-gdb would be much appreciated. Thanks. Otherwise all has been said here.

system · November 30, 2022, 2:54pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
cuda-gdb hangs CUDA-GDB	12	8514	May 23, 2014
attach cuda-gdb to a running process failed CUDA-GDB	10	3238	November 29, 2017
The CUDA driver hit an error while forking off the debugger process. CUDA-GDB	11	2829	November 17, 2017
cuda-gdb hang and compiled program spewing nonsense CUDA Programming and Performance	7	2307	February 15, 2011
cuda-gdb detach causes "unspecified launch failure" Jetson TX1	7	1379	October 26, 2016
Cuda-gdb doesn't break and/or step into Kernels CUDA Programming and Performance	26	54060	August 1, 2011
Cannot debug cuda application CUDA Programming and Performance	0	3540	July 5, 2010
CUDA_DEVICE_WAITS_ON_EXCEPTION renders the driver unusable if an exception happens Jetson TX1	6	1880	December 8, 2016
CUDA GDB hang on cudamalloc(), single GPU CUDA-GDB	6	2819	May 14, 2018
cuda-gdb has driver error initialization CUDA Programming and Performance	2	1719	July 13, 2010

Cannot attach crashing down process via cuda-gdb

Related topics