cuda (375.66) is failing with uknown error 30 after suspending Ubuntu 16.04

Hi,

I have GTX 1080ti on Ubuntu 16.04

Driver: 375.66

CUDA 8.0-61.1
libcuda 375.66
cuda-drivers 375.51-1

Problem:

After suspending PC cuda lib is not working anymore:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available  (error: Unable to get the number of gpus available: unknown error)

However at the same time nvidia-smi works absolutetly fine

cypreess@gtx:~/dev/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release$ nvidia-smi 
Fri Jun  9 20:49:59 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 0000:01:00.0      On |                  N/A |
| 23%   37C    P8    18W / 250W |    727MiB / 11169MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1261    G   /usr/lib/xorg/Xorg                              25MiB |
|    0      1463    G   /usr/lib/xorg/Xorg                             329MiB |
|    0      1801    G   /usr/bin/gnome-shell                           182MiB |
|    0      2526    G   /proc/self/exe                                  27MiB |
|    0      5243    G   ...el-token=682029E6D17C8080D4B5A7BE0DA20F10   130MiB |
+-----------------------------------------------------------------------------+

I was trying to change mode using:

/usr/bin/nvidia-smi -pm ENABLED
/usr/bin/nvidia-smi -c EXCLUSIVE_PROCESS

without any luck.

Only PC restart fixes the problem.

nvidia-bug-report.log.gz (168 KB)

Update:

I just found a workaround to avoid PC restart:

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

Can Nvidia please provide a response to that problem?

I can confirm that it also happens on Quadro M2000M.

It is still happening on 384.69 drivers and it seems quite ridiculous to have to unload/load modules after suspend/resume cycle in 2017.

Thanks!

I would like to add more information to this problem:

  • Enabling persistence mode in persistenced actually makes box lockup on awake - no pings, nothings moves, total freeze.
  • With persistence mode disabled dmesg shows following on awake:

[ 319.338880] NVRM: Xid (PCI:0000:01:00): 31, Ch 00000030, engmask 00000101, intr 10000000
[ 319.342072] NVRM: Xid (PCI:0000:01:00): 31, Ch 00000003, engmask 00000104, intr 10000000

Is there any way to fix this problem?