I was using llvm nvptx backend to generate some device functions. But when trying to debug them, cuda-gdb sometimes behave strangely.
In the example in the attachment, I implement a simple device function with nearly no actual function, but with a local FILE pointer.
//lib1.c
#include <stdio.h>
int test()
{
FILE* x;
int y;
y=1+2;
return y;
}
When I try to step into this function from a global function, even if with stepi, cuda-gdb would throw out errors.
Error: Failed to read_call_depth (dev=0, sm=0, wp=0, ln=0), error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c).
Error: A SIGPIPE has been received, this is likely due to a crash from the CUDA backend.
Error: Failed to suspend device for CUDA device 0, error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c).
Error: Failed to suspend device for CUDA device 0, error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c).
...
However, if I replace the FILE* with void*, the llvm backend would generate similar ptx(if we ignore the debug information), but cuda-gdb could step into the function call.
//lib2.c
#include <stdio.h>
int test()
{
void* x;
int y;
y=1+2;
return y;
}
I was testing on RTX3060, driver 525.60.13, cuda tool kit 11.8.0.
cuda-gdb 12 would not throw out repeating CUDBG_ERROR_COMMUNICATION_FAILURE, but still couldn’t step into the function call successfully.
I would like to use the llvm nvptx backend to do some development, so it would be nice to have the debugging tools work properly.
Well, here’s the full project with generated files included. But I thought it would be enough to reproduce the error by compiling from the provided ptx and cu files. CudaStep.bak2.tar.xz (223.9 KB)
In the tarball uploaded, there are two ptx files “lib1.ptx” and “lib2.ptx”, and they are built form similar source codes. However when debugging the resulting cuda application.
main1.out, produced by linking main.cu+lib1.ptx would cause cuda-gdb to crash.
main2.out, produced by linking main.cu+lib2.ptx could step into the function call.
When the cuda-gdb printout the error code, I find that a NvDebugAgent process created by the application being debugged has crashed. Command “ps” would output something like below.
... [NvDebugAgent] <defunct>
If I try to attach a debugger onto the NvDebugAgent process, the debugger would say that the process crashes on segment fault, stopping on a instruction trying to access inaccessible memory.
The NvDebugAgent appears to execute something in “libcudadebugger.so”. I haven’t found relevant source code, so is that file close sourced sofar?
Why does these small changes lead to such large difference on the cuda-gdb behavior? The crash is caused by some debug information or something else? Do you have some hint on how to avoid the crash?
Hi @sjlcwn
A fix for this issue will go out in an upcoming CUDA toolkit release. In the meantime please continue to use the CUDBG_USE_LEGACY_DEBUGGER = 1 environment variable to debug your code sample.