Cuda-gdb CUDBG_ERROR_COMMUNICATION_FAILURE when stepping into a function generated by llvm

I was using llvm nvptx backend to generate some device functions. But when trying to debug them, cuda-gdb sometimes behave strangely.
In the example in the attachment, I implement a simple device function with nearly no actual function, but with a local FILE pointer.

//lib1.c
#include <stdio.h>

int test()
{
        FILE* x;
        int y;
        y=1+2;
        return y;
}

When I try to step into this function from a global function, even if with stepi, cuda-gdb would throw out errors.

Error: Failed to read_call_depth (dev=0, sm=0, wp=0, ln=0), error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c).
Error: A SIGPIPE has been received, this is likely due to a crash from the CUDA backend.
Error: Failed to suspend device for CUDA device 0, error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c).
Error: Failed to suspend device for CUDA device 0, error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c).
...

However, if I replace the FILE* with void*, the llvm backend would generate similar ptx(if we ignore the debug information), but cuda-gdb could step into the function call.

//lib2.c
#include <stdio.h>

int test()
{
        void* x;
        int y;
        y=1+2;
        return y;
}

I was testing on RTX3060, driver 525.60.13, cuda tool kit 11.8.0.
cuda-gdb 12 would not throw out repeating CUDBG_ERROR_COMMUNICATION_FAILURE, but still couldn’t step into the function call successfully.

I would like to use the llvm nvptx backend to do some development, so it would be nice to have the debugging tools work properly.

Looking for help.

CudaStep.tar.xz (6.6 KB)

Hi @sjlcwn!

Thank you very much for reporting the issue, we are currently looking at it. Could you please share an additional info from your side:

  • Output of the cuobjdump -elf command for both final executables (main1 and main2). E.g. (if I understand your Makefile correctly):
cuobjdump -elf main1.out
cuobjdump -elf main2.out

Yes, the dumped results are here.

Elf.tar.xz (6.37 KB)

Hi @sjlcwn,
Thank you for the dumps. Could we also get the original elf files? (main1.out and main2.out)

Well, here’s the full project with generated files included. But I thought it would be enough to reproduce the error by compiling from the provided ptx and cu files.
CudaStep.bak2.tar.xz (223.9 KB)

After testing for a while, I found that, similar error could be triggered by some quite simple code.

CudaGdbCrash.tar.xz (282.6 KB)

In the tarball uploaded, there are two ptx files “lib1.ptx” and “lib2.ptx”, and they are built form similar source codes. However when debugging the resulting cuda application.
main1.out, produced by linking main.cu+lib1.ptx would cause cuda-gdb to crash.
main2.out, produced by linking main.cu+lib2.ptx could step into the function call.

When the cuda-gdb printout the error code, I find that a NvDebugAgent process created by the application being debugged has crashed. Command “ps” would output something like below.

... [NvDebugAgent] <defunct>

If I try to attach a debugger onto the NvDebugAgent process, the debugger would say that the process crashes on segment fault, stopping on a instruction trying to access inaccessible memory.

The NvDebugAgent appears to execute something in “libcudadebugger.so”. I haven’t found relevant source code, so is that file close sourced sofar?

Why does these small changes lead to such large difference on the cuda-gdb behavior? The crash is caused by some debug information or something else? Do you have some hint on how to avoid the crash?

Setting environment variable export CUDBG_USE_LEGACY_DEBUGGER=1 seems to help with the crash.

Hi @sjlcwn
Thank you very much for the detailed info, we are looking at the crash right now.

Setting environment variable export CUDBG_USE_LEGACY_DEBUGGER=1 seems to help with the crash.

Thank you! This should help to narrow down issue.