I’ve been writing CUDA code for a while and have not used cuda-gdb recently. Sysadmin upgraded CUDA to 9.2 and I can’t seem to get cuda-gdb to work. I have multiple codes that work correctly, but no matter what, each time I run the code inside cuda-gdb I get the error message
Error: Failed to suspend device for CUDA device 0, error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c)
This is on a machine that has two Tesla K20c cards. As for details, I am running it on the command-line from a bash shell on a Ubuntu Linux system. The error occurs as soon as I try to step into any kernel no matter how simple. Any hints how to resolve this would be appreciated.
Have you solved this problem? I have the same issue. My hardware is Tesla K20Xm. My OS is Debian sid. I tried CUDA 9.0, 9.2 and 10.0, and all of them give the same results. My program can run but the cuda-gdb just give me the following error:
[New Thread 0x7fffec34e700 (LWP 18581)]
[New Thread 0x7fffebb4d700 (LWP 18582)]
Error: Failed to suspend device for CUDA device 0, error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c).
Sorry, no luck so far. I left it in the hands of my sysadmin and have moved on for now. He tried a few things but decided to wait and see if anyone replied to this request. As you can see, nothing has happened. May need to ping the NVIDIA people. If you do find a solution, please post.
I am having the same issue. I am uisng Debian 9 with a Quadro K620 (device 1) for display and a Titan Xp (devcie 0).
The error message I get from cuda-gdb is
“Error: Failed to suspend device for CUDA device 1, error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c).”
When I switch to the text-based console/teminal (pressing ctrl+alt+f1), cuda-gdb works fine without the error.
I guess the error is because it cannot “suspend” the GPU being used for display. Or maybe I connect too many monitors to the GPU (3).
In fact, the GPU machine I used is a local cluster. I use ssh to connect to it, so I already use the terminal/console. My problem may be something else. Thank you all the same.
No news. I followed the exact official steps to install the CUDA. Is it possible that the hardware compatibility with the latest CUDA? Now I am using “printf” to debug.
Error: Failed to suspend device for CUDA device 0, error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c).
I am experiencing the exact same on Ubuntu 18.04.3. Both with cuda 9 (the one coming as an Ubuntu package) and cuda 10.1 as installed via a .run installer.
I have both a GeForce GTX 660 Ti and a Tesla K20mX installed. The problem occurs both when using the Tesla and when running in console mode (without X blocking the card) on the GeForce.
This one really drives me nuts. I am desparately looking for a way to debug my code.
In a multi-GPU setup, try using CUDA_VISIBLE_DEVICES environment variable to restrict CUDA runtime visibility to the GPU that is actually being used for debug:
CUDA_VISIBLE_DEVICES=“1” cuda-gdb …
(for example. It may be necessary to specify “0” or some other number)
Update to the latest CUDA 10.1U2, i.e. cuda 10.1.243 (or whatever is later than that)
Reboot (either into recovery mode (to prevent a display manager from being started) or shut down the running display manager after a normal boot
sudo systemctl stop sddm # kubuntu
sudo systemctl stop gdm # ubuntu
Then change to a graphical console e.g. by hitting Ctrl-Alt-F2 and login
Run the nvidia run-file installer for the nvidia graphics driver
Reboot and make sure the nvidia graphics driver is loaded (e.g. by using lsmod). If the nouveau driver is still loaded, try to blacklist it manually or run the nvidia driver installer again which hopefully will blacklist it for you.
Reboot again (and try to blacklist nouveau until the nvidia driver 435.xx) is actually loaded.
Just for the record: The error I reported occured even when debugging without an X server running and even when I removed the second NVIDIA device from my machine (both devices are cuda capable). So it was unrelated to device visibility (CUDA_VISIBLE_DEVICES) which I also tried to switch between the two available cuda devices.
Unfortunately it turns out that not all of my problems are solved. Now debugging with cuda-gdb works on the GeForce GTX 660 Ti but not on the Tesla K20Xm which hangs during any cuda api call.