I do export CUDA_DEVICE_WAITS_ON_EXCEPTION=1 then execute my program, which crashes so i am trying to see what is wrong by cuda-gdb -p
Here is what i am getting:
The CUDA driver could not allocate operating system resources for attaching to the application.
An error occurred while in a function called from GDB.
Evaluation of the expression containing the function
(cudbgApiAttach) will be abandoned.
When the function is done executing, GDB will silently stop.
My versions:
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4
Kernel: 3.10.0-1062.18.1.1.el7.dug.x86_64 (standard Centos 7 kernel with minor, unrelated patches)
The shared library i am using is linking CUDA 11.2 statically.
Any ideas how to get to the bottom of it?
Hi @jacek.tomaka,
Thank you for your report! Could you please share additional details about the issue:
Can you start your application under debugger and run the debugging session after it crashes? This would allow us to determine whether it’s an attach issue or a generic debugging issue.
Output of the nvidia-smi command when application crashes. It should print the amount of GPU memory available.
Output of the free command when application crashes
dmesg output after failed attach
This error might also indicate that debugger (in-process) is hitting the open FD limit (i.e. cannot open/create file). If your application is opening a lot of files, you could also try increasing the limit.
In this instance, this application did not even crash, well i am not certain what happens because can’t attach the debugger:
dmesg output after attempt to attach:
nvidia-smi:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla PG500-216 On | 00000000:3B:00.0 Off | 0 |
| N/A 44C P0 39W / 250W | 23097MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla PG500-216 On | 00000000:61:00.0 Off | 0 |
| N/A 45C P0 38W / 250W | 22981MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 Tesla PG500-216 On | 00000000:86:00.0 Off | 0 |
| N/A 44C P0 37W / 250W | 22979MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 Tesla PG500-216 On | 00000000:DB:00.0 Off | 0 |
| N/A 44C P0 38W / 250W | 22979MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 143139 C …202211041153/jre/bin/java 23081MiB |
| 1 N/A N/A 143139 C …202211041153/jre/bin/java 22965MiB |
| 2 N/A N/A 143139 C …202211041153/jre/bin/java 22963MiB |
| 3 N/A N/A 143139 C …202211041153/jre/bin/java 22963MiB |
±----------------------------------------------------------------------------+
free -m
total used free shared buff/cache available
Mem: 191571 7187 61314 621 123069 182868
I will try to start an app from the debugger next.
BTW: could you grep for this error in the source code of the driver and CUDA to see what it tries to do?
This error means that some system resource could not be allocated:
Could not allocate memory (malloc/calloc)
Could not create/open file
Could not launch a new process via execl)
We don’t have hard constrains for system resources, since the required memory amount depends on the debugged application. In general debugger expects to be able to allocate CPU and GPU memory, create temporary files on disk and launch processes.
I am less familiar with slurm - can running the job under slurm restrict its access to the resources listed above?
How can i tell which one is it? I tried to strace cuda-gdb but it just hangs when i do it on ptrace(PTRACE_TRACEME) = -1 EPERM (Operation not permitted)
Sure. Is there any way to get information about which fails?
Yes, by the looks;) I am not sure how, though.
Is there a way to make cudbgApiAttach or the driver itself more talkative?
Could you give me the symbol name of the function that prints this: “The CUDA driver could not allocate operating system resources for attaching to the application.” ?
The failure happens in the debugged process (CUDA application) - part of the GPU debugging is implemented in libcuda.so (which is loaded into the CUDA application). You can try tracing it, but using cuda-gdb and strace on the same process might lead to a strange issues (we haven’t really tested such scenarios).
Unfortunately we don’t have such mechanism (in release builds), so you would need to rely on system-wide reporting:
Well yes, as mentioned above i have explored that option but it does not lead anywhere at least given my limited knowledge.
I figured as much as that cudbgApiAttach is defined in libcuda.so.1 but still, it’s the interactions with the driver is what is interesting here and i had no luck.
If i understand the interaction, the real error is in the kernel. And the tools you propose concern the userspace(except dmesg which is clean in this case). What am i missing here?
Oh, right! This will get me going for a bit. Thanks!
Thanks.
Do you mean from cudbgReportedDriverInternalErrorCode? Which returns CUDBG_ERROR_OS_RESOURCES = 0x0025 in the case of interest?
Any chances you could provide me with libcuda.so.1 compatible with 11.4 that prints more information when this happens? Or at least a list of functions i could set the breakpoints at to see which one fails.
Yes, this error is returned from libcuda.so (from inside the debugger process) when it cannot allocate system resources.
Unfortunately, it’s not possible. Also only the top level entry function (cudbgApiAttach) is exposed from libcuda.so, the rest of the functions are internal and are not visible from outside.
Since the debugging is working without slurm you might need to work withslurm configuration to figure out what resources are restricted (compared to running the application without slurm).
CUDA debugger doesn’t have any built-in capabilities for detailed reporting of which system call has failed, so you would need to rely on system tools.
The issue is that cuda-gdb sets up session dir in a subdirectory of it’s own $TMPDIR and not debugee’s TMPDIR. When the two are different (as is the case of running under slurm) creat call to create cudbgprocess file fails.
Note that this can be reproduced even on local machine.
gdb attached to cuda-gdb:
gdb) bt
#0 0x00002b3ddd5346a0 in mkdir () from /lib64/libc.so.6
#1 0x0000000000532f6c in cuda_gdb_dir_create(char const*, unsigned int, bool, bool*) ()
#2 0x0000000000533682 in cuda_gdb_tmpdir_setup() ()
#3 0x0000000000533fcd in cuda_utils_initialize() ()
#4 0x0000000000531eb9 in cuda_initialize_target() ()
#5 0x000000000051f362 in cuda_nat_attach() ()
#6 0x00000000005e20e4 in attach_post_wait(char const*, int, attach_post_wait_mode) [clone .isra.41] ()
#7 0x00000000004df815 in do_all_inferior_continuations(int) ()
#8 0x00000000005db590 in inferior_event_handler(inferior_event_type, void*) ()
#9 0x00000000005f4716 in fetch_inferior_event(void*) ()
#10 0x0000000000597c4c in gdb_do_one_event() [clone .part.6] ()
#11 0x0000000000597d1e in start_event_loop() ()
#12 0x000000000062ed58 in captured_command_loop() ()
#13 0x000000000062fc6d in gdb_main(captured_main_args*) ()
#14 0x000000000040d705 in main ()
(gdb) print (char *) $rdi
$1 = 0x7ffe3dc88270 "/dev/shm/cuda-dbg/190882"
cuda-gdb attached to crashing cuda process:
When the function is done executing, GDB will silently stop.
(cuda-gdb) bt
#0 0x00002b832c297140 in creat64 () from /lib64/libc.so.6
#1 0x00002b832c9a27ec in ?? () from /lib64/libcuda.so.1
#2 <function called from gdb>
#3 0x00007ffc799cf6c2 in clock_gettime ()
#4 0x00002b832c2bb7ad in clock_gettime () from /lib64/libc.so.6
#5 0x00002b832c88b27f in ?? () from /lib64/libcuda.so.1
#6 0x00002b832c7f33d3 in ?? () from /lib64/libcuda.so.1
#7 0x00002b832c75413f in ?? () from /lib64/libcuda.so.1
#8 0x00002b832c755788 in ?? () from /lib64/libcuda.so.1
#9 0x00002b832c7b8faa in ?? () from /lib64/libcuda.so.1
#10 0x00000000004051a7 in __cudart1030 ()
#11 0x00000000004359a5 in cudaDeviceSynchronize ()
#12 0x0000000000403b10 in main ()
(cuda-gdb) print (char *) $rdi
$1 = 0x7ffc7998d800 "/tmp/tmpdir.47888517//cuda-dbg/190882/session1/cudbgprocess"
Note that this problem would have been identified in minutes and not days if the error was reported properly (indicating exactly what resource allocation failed)
Hi @jacek.tomaka,
Thank you very much for getting to the bottom of it! I have recorded the problem in our bug tracker - it will be addressed in one of the upcoming CUDA toolkit releases.
Is there anything else I can help you with? (or can we mark the topic as resolved)?