Can't use Cuda-gdb

Hi,
I can’t use cuda-gdb on my machine, whether I use sudo or not, here is the error message I get from cuda-gdb:

Could not find CUDA Debugger back-end. Please try upgrading/re-installing the GPU driver
The CUDA driver has hit an internal error.
Error code: 0x1012400000001c
Further execution or debugging is unreliable.
Please ensure that your temporary directory is mounted with write and exec permissions.

My system:

  • arch linux 6.0.2-arch1-1
  • /tmp has rwx permission
nvidia-smi
Fri Nov 25 16:32:11 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce RTX 3050 Ti Mobile  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   62C    P8    11W /  N/A |      2MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |

Any idea?
Thanks

Hi @benjamin.rouxel1
Thank you for your report. Looks like your CUDA driver installation might be missing a debugger back-end library (introduced in CUDA 11.8). Could you please provide the following additional information:

  • How did you install the GPU driver? Could you provide a link to the package/installer?
  • Please share the output of the following command in the directory, where libcuda.so.1 is located (e.g. /usr/lib/x86_64-linux-gnu but might be different on Arch):
ls -la libcuda*

Hi

ls -la libcuda*
lrwxrwxrwx 1 root root  12 19 oct.  00:50 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root  20 19 oct.  00:50 libcuda.so.1 -> libcuda.so.520.56.06
-rwxr-xr-x 1 root root 26M 19 oct.  00:50 libcuda.so.520.56.06

I installed the driver (and cuda) using the package manager pacman/yay, https://archlinux.org/packages/extra/x86_64/nvidia/.

Thanks

Ok, looks like the installer is missing the newly added libcudadebugger.so.1 file. What you can do right now (untill Arch GPU Driver package is updated):

  • Try running the debugger with export CUDBG_USE_LEGACY_DEBUGGER=1 environment variable. This should use the debugger back-end in libcuda.so.
  • Install GPU driver via official Nvidia installer Linux x64 (AMD64/EM64T) Display Driver | 520.56.06 | Linux 64-bit | NVIDIA
  • Obtain the .run installer from the link above, un-pack it and copy the libcudadebugger* libraries to a specific directory. Add this directory to the LD_LIBRARY_PATH when running the debugger session (dlopen("libcudadebugger.so.1") should work).

Hi,
Thanks for you answer.

  1. export CUDBG_USE_LEGACY_DEBUGGER=1
    I get fatal: The CUDA driver initialization failed. (error code = CUDBG_ERROR_INITIALIZATION_FAILURE(0x14)

./NVIDIA-Linux-x86_64-520.56.06.run -x
ls -l /usr/local/lib/libcudadeb*
lrwxrwxrwx 1 root    root     28 27 nov.  17:32 /usr/local/lib/libcudadebugger.so -> libcudadebugger.so.520.56.06
lrwxrwxrwx 1 root    root     28 27 nov.  17:32 /usr/local/lib/libcudadebugger.so.1 -> libcudadebugger.so.520.56.06
-rwxr-xr-x 1 root root 11M 30 sept. 18:07 /usr/local/lib/libcudadebugger.so.520.56.06
echo $LD_LIBRARY_PATH
.... :/usr/local/lib: ....
export CUDBG_USE_LEGACY_DEBUGGER=0

I (strangely) get: fatal: No CUDA capable device was found. (error code = CUDBG_ERROR_NO_DEVICE_AVAILABLE(0x27)

I rebooted. For some reasons sometimes the driver or the GPU gets bloated, and I have to reboot. The second solution works perfectly.

Thank you.

Hi @benjamin.rouxel1
Thank you for confirming that it worked for you!