cuda-gdb error

I’ve been writing CUDA code for a while and have not used cuda-gdb recently. Sysadmin upgraded CUDA to 9.2 and I can’t seem to get cuda-gdb to work. I have multiple codes that work correctly, but no matter what, each time I run the code inside cuda-gdb I get the error message

Error: Failed to suspend device for CUDA device 0, error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c)

This is on a machine that has two Tesla K20c cards. As for details, I am running it on the command-line from a bash shell on a Ubuntu Linux system. The error occurs as soon as I try to step into any kernel no matter how simple. Any hints how to resolve this would be appreciated.

Dave

Have you solved this problem? I have the same issue. My hardware is Tesla K20Xm. My OS is Debian sid. I tried CUDA 9.0, 9.2 and 10.0, and all of them give the same results. My program can run but the cuda-gdb just give me the following error:
[New Thread 0x7fffec34e700 (LWP 18581)]
[New Thread 0x7fffebb4d700 (LWP 18582)]
Error: Failed to suspend device for CUDA device 0, error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c).

Sorry, no luck so far. I left it in the hands of my sysadmin and have moved on for now. He tried a few things but decided to wait and see if anyone replied to this request. As you can see, nothing has happened. May need to ping the NVIDIA people. If you do find a solution, please post.

I am having the same issue. I am uisng Debian 9 with a Quadro K620 (device 1) for display and a Titan Xp (devcie 0).
The error message I get from cuda-gdb is
“Error: Failed to suspend device for CUDA device 1, error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c).”

When I switch to the text-based console/teminal (pressing ctrl+alt+f1), cuda-gdb works fine without the error.

I guess the error is because it cannot “suspend” the GPU being used for display. Or maybe I connect too many monitors to the GPU (3).

In fact, the GPU machine I used is a local cluster. I use ssh to connect to it, so I already use the terminal/console. My problem may be something else. Thank you all the same.

Hello,

any news on this topic? I’ve the same issue on Oracle Linux 7.6 (Red Hat) with Cuda 10 and an Quadro 410 + GT 720. Even with this simple code:

#include <stdlib.h>

int main(void)
{
        int nDevices;
	cudaGetDeviceCount(&nDevices);

	return 0;
}

KR,
Iggi

Still no answers.

No news. I followed the exact official steps to install the CUDA. Is it possible that the hardware compatibility with the latest CUDA? Now I am using “printf” to debug.

Still nobody ? I have same issue with cuda 10:

Error: Failed to suspend device for CUDA device 0, error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c).

error by Cuda Debugger API:
CUDBG_ERROR_COMMUNICATION_FAILURE - Communication error between the debugger and the application.

graphic card is not used for desktop output:

glxinfo|egrep “OpenGL vendor|OpenGL renderer”
OpenGL vendor string: Intel Open Source Technology Center
OpenGL renderer string: Mesa DRI Intel(R) Ivybridge Desktop

Unfortunately i have ubuntu 18 and i can’t downgrade cuda to lower version when all worked fine.

Error: Failed to suspend device for CUDA device 0, error=CUDBG_ERROR_COMMUNICATION_FAILURE(0x1c).

I am experiencing the exact same on Ubuntu 18.04.3. Both with cuda 9 (the one coming as an Ubuntu package) and cuda 10.1 as installed via a .run installer.

I have both a GeForce GTX 660 Ti and a Tesla K20mX installed. The problem occurs both when using the Tesla and when running in console mode (without X blocking the card) on the GeForce.

This one really drives me nuts. I am desparately looking for a way to debug my code.

A few suggestions.

  1. In a multi-GPU setup, try using CUDA_VISIBLE_DEVICES environment variable to restrict CUDA runtime visibility to the GPU that is actually being used for debug:

CUDA_VISIBLE_DEVICES=“1” cuda-gdb …

(for example. It may be necessary to specify “0” or some other number)

  1. Update to the latest CUDA 10.1U2, i.e. cuda 10.1.243 (or whatever is later than that)

  2. Update driver to the latest for your GPU, preferably 435.xx or later. [url]https://www.nvidia.com/download/driverResults.aspx/149785/en-us[/url]

  3. don’t debug on a GPU that is being used for display of any kind

Thanks a lot Robert, installing the latest driver (runfile) and cuda did the trick for me.

For everybody using Ubuntu 18.04.3, here’s what I did:

  1. Remove all nvidia related ubuntu packages (assuming you had not used run file installers)
    sudo apt purge “nvidia

  2. Removed all cuda related stuff (again assuming you had not used run file installers)
    sudo apt purge “cuda

  3. Dowload the driver (beta) suggested by Robert from https://www.nvidia.com/download/driverResults.aspx/149785/en-us

  4. Reboot (either into recovery mode (to prevent a display manager from being started) or shut down the running display manager after a normal boot
    sudo systemctl stop sddm # kubuntu
    sudo systemctl stop gdm # ubuntu

Then change to a graphical console e.g. by hitting Ctrl-Alt-F2 and login

  1. Run the nvidia run-file installer for the nvidia graphics driver

  2. Reboot and make sure the nvidia graphics driver is loaded (e.g. by using lsmod). If the nouveau driver is still loaded, try to blacklist it manually or run the nvidia driver installer again which hopefully will blacklist it for you.

  3. Reboot again (and try to blacklist nouveau until the nvidia driver 435.xx) is actually loaded.

  4. Download CUDA 10.1U2 from CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer and run the downloaded cuda runfile installer.

8.1) Not sure if a reboot is necessary after cuda installation.

  1. Check if nvidia-smi works (which will not be the case, e.g. if the nouveau driver is still loaded)

  2. Hopefully enjoy debugging with cuda-gdb

Just for the record: The error I reported occured even when debugging without an X server running and even when I removed the second NVIDIA device from my machine (both devices are cuda capable). So it was unrelated to device visibility (CUDA_VISIBLE_DEVICES) which I also tried to switch between the two available cuda devices.

Unfortunately it turns out that not all of my problems are solved. Now debugging with cuda-gdb works on the GeForce GTX 660 Ti but not on the Tesla K20Xm which hangs during any cuda api call.

Does the 435.xx beta driver support Tesla devices at all? At least they are not listed at https://www.nvidia.com/download/driverResults.aspx/149785/en-us

To be more precise, running any cuda program hangs when run on the Tesla card no matter if run in the debugger or directly.

After a restart, the Tesla apears to work normally again. Seems to be some kind of hickup. Will observe this more thoroughly.