Cuda error 30 (unknown error) after suspend

The problem is any Cuda call returns with error code 30 (unknown error). This happens after suspend/wake and fixes after reboot. I tried driver versions 375.66 and 384.47, but the problem persists.

<b>$ uname -a</b>
Linux WS1005 4.8.0-58-generic #63~16.04.1-Ubuntu SMP Mon Jun 26 18:08:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

| NVIDIA-SMI 384.47                 Driver Version: 384.47                    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0  On |                  N/A |
| 28%   31C    P8     8W / 151W |    530MiB /  8113MiB |      0%      Default |
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|    0      3424    G   /usr/lib/xorg/Xorg                             354MiB |
|    0      3987    G   compiz                                          93MiB |
|    0      4777    G   ...el-token=1C9EB7F783F4F988F1752CC22A98C44A    79MiB |

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL

Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import caffe
>>> caffe.set_device(0)
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0719 11:13:07.302003  5523 common.cpp:151] Check failed: error == cudaSuccess (30 vs. 0)  unknown error
*** Check failure stack trace: ***

I also want to mention that the problem is not always reproducible after a suspend/wake, but if you keep trying you “succeed”.

I noticed suspicious lines in dmesg right after the wake when Cuda starts to fail:

This is still actual

The latest linux driver:

mentions a fix for a suspend issue. It may be worth a try.

Hi txbob,

updating to this driver version didn’t help.

Do you really need to reboot or is it enough to reload the driver? The latter is the case for me with a GTX1050 under Ubuntu 16.04.

Hi tera,

could you explain how you reload the driver?

Stop all programs using the driver (particularly X11).
“lsmod | grep nvidia” and rmmod the modules with zero use count.
Repeat until no Nvidia kernel modules are loaded.
“modprobe nvidia” to reload the driver.
Restart X11 or whatever was using the GPU.

Once you know the order in which to unload modules, you can also package the whole process into a script.

I would like to avoid restarting X11, because it would close all my window applications, including terminals…

I am running X on the integrated Intel GPU, so it doesn’t need to be restarted. I am reserving the discrete GPU entirely for CUDA. Your priorities may be different, of course.

Still an issue with Driver Version: 418.67. Any status updates on this bug? Restarting X11 is just as painful as rebooting the whole machine.

Hi nocnokneo

If you have an integrated Intel GPU and X11 runs on it.You may try the solution that tera recommended