nvidia driver conflict CUDA_ERROR_NO_DEVICE

Ubuntu 16.04
Cuda 9.1

I had my tensorflow-gpu working but today I ran. apt-get update and apt-get upgrade.

I believe it half-way updated my nvidia driver to 390.30 and this is not allowing cuda to find the gpu device.

running the command
cat /proc/driver/nvidia/version

produces 390.12 as the driver

NVRM version: NVIDIA UNIX x86_64 Kernel Module 390.12 Wed Dec 20 07:19:16 PST 2017
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.6)

However if check in ubuntu Software and Updates under additional drivers it says 390.30 from nvidia-390(proprietary)

The error I am getting in tensorflow indicates this configuration mismatch. Here is the error:

sess = tf.Session()
2018-02-13 16:16:49.861468: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-02-13 16:16:49.865479: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-02-13 16:16:49.865571: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: teves
2018-02-13 16:16:49.865599: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: teves
2018-02-13 16:16:49.865685: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 390.30.0
2018-02-13 16:16:49.865740: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: “”“NVRM version: NVIDIA UNIX x86_64 Kernel Module 390.12 Wed Dec 20 07:19:16 PST 2017
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.6)
“””
2018-02-13 16:16:49.865788: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 390.12.0
2018-02-13 16:16:49.865811: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version 390.12.0 does not match DSO version 390.30.0 – cannot find working devices in this configuration

How can I fix this?

Either downgrade cuda to 390.12 or install the nvidia-390.30 from cuda repository:
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/

Hi generix,

Thank you for the reply. So I purged the drivers and then installed nvidia-390.30. I have now been getting a Memory allocation error because something is not releasing back the memory after starting a tf session. I have also tried downgrading to cuda 9.0 but the problem is the same.

If I start a python terminal then:

import tensorflow as tf
sess = tf.Session()

The first time I get freeMemory: 7.50GiB
then I run:

sess.close()

I exit python and the start another python session in the same terminal.

I run the commands again and get:
freeMemory: 279.44MiB

And the 3rd time I run the commands I get freeMemory: 122.50MiB and failed to allocate 72.50M from device CUDA_ERROR_OUT_OF_MEMORY

Subsequently I am never able to get the memory back unless I restart my system.

Do you have any ideas on this? I am becoming a bit desperate.

Use nvidia-smi -q to see what’s hogging the mem. Might be a memory management bug in the driver, revert to libcuda-390.12 and nvidia-390.12 and check.

Hi generix,

It says it is being used by python but python is closed.

I will try to revert back to the old driver using the link you posted before. I am a bit nervous bacause when I have installed drivers from nvidia directly I have ended up in a login loop. I have to usually install from the ubuntu repository. I will give it a try though.

Here is the output of the nvidia-smi -q command:

teves@teves:~$ nvidia-smi -q

==============NVSMI LOG==============

Timestamp : Thu Feb 15 12:17:52 2018
Driver Version : 390.30

Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : GeForce GTX 1070
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-587f184e-5b95-9c68-74eb-c6d316a2941b
Minor Number : 0
VBIOS Version : 86.04.5E.00.1A
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1BE110DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x07C01028
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 1000 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 8111 MiB
Used : 8040 MiB
Free : 71 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 14 MiB
Free : 242 MiB
Compute Mode : Default
Utilization
Gpu : 4 %
Memory : 3 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 52 C
GPU Shutdown Temp : 99 C
GPU Slowdown Temp : 94 C
GPU Max Operating Temp : 91 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : N/A
Power Draw : 11.83 W
Power Limit : N/A
Default Power Limit : N/A
Enforced Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 4004 MHz
Video : 1708 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1151
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 116 MiB
Process ID : 1769
Type : G
Name : compiz
Used GPU Memory : 37 MiB
Process ID : 2671
Type : G
Name : /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=15529036578042773042,10292409604259125779,131072 --gpu-preferences=GAAAAAAAAAAAAQAAAQAAAAAAAAAAAGAA --gpu-vendor-id=0x10de --gpu-device-id=0x1be1 --gpu-driver-vendor=Nvidia --gpu-driver-version=390.30 --gpu-driver-date --service-request-channel-token=092EE45ACADD6787D3E749B3FCBD1F66
Used GPU Memory : 101 MiB
Process ID : 23034
Type : C
Name : python
Used GPU Memory : 7391 MiB
Process ID : 23092
Type : C
Name : python
Used GPU Memory : 149 MiB
Process ID : 23153
Type : C
Name : python
Used GPU Memory : 209 MiB

Hi generex,

Can you give me instructions on how to rollback the driver to 390.12? I see there are several files with 390.12 in the repos. I just want to be sure on this point. Thanks

I was able to rollback the driver using the run file located here:

http://www.nvidia.com/download/driverResults.aspx/128743/en-us

It seems to have helped the problem. If I use the same terminal and open and close python and start tf.Session three seperate times I get the same error. However if I close the terminal and start a tf.Session the memory is again available. This is progress.

Is this the expected behavior?

Also how do I set this driver to not be updated?

The .run installer is not the best option to do this. Unless you have used the DKMS option you will have to reinstall the kernel driver (-K) on kernel updates.
You can also just use the Ubuntu driver from the graphics ppa, should pull in dependencies by itself.

The vmem getting hogged is not the ‘expected’ behavior, seems some driver bug


last comment in there.

Unfortunately I was not given a DKMS option.

I guess I will uninstall and reinstall through ppa to make sure I dont have issues in the future.

Just to confirm

Would you expect the following commands to work best?

sudo add-apt-repository ppa:graphics-drivers
sudo apt-get update
sudo apt-get install nvidia-390.12

You could try the 384 drivers, maybe the bug is not apparent there.
sudo apt install nvidia-graphics-drivers-384

the 390.12 was a beta and has been phased out of the ppa as it seems
nvidia-graphics-drivers-390
would bring you to 390.25
There you could use the workaround as in the github issue mentioned
nvidia-smi -q |grep -2 python
gives you pid and mem consumption of python processes, then
kill -9

I had the same issue and simply rebooted my desktop and run the command again , surprisingly it worked !