Unable to load the 'nvidia-drm' kernel module on Ubuntu 18.04

I tried solutions to this problem from different threads in this forum - but had no success so far. Purging old kernels, purging all nvidia installations, etc - none have worked so far. I have attached the bug report below. I would really appreciate help in fixing this issue.

nvidia-bug-report.log.gz (77.4 KB)

Looks like you installed the .run installer over the Ubuntu provided driver.

Thanks for the quick response. I tried the following:
-sudo ./NVIDIA-Linux-x86_64-410.78.run --uninstall
-sudo add-apt-repository ppa:graphics-drivers/ppa
-sudo apt-get update
-sudo apt install nvidia-driver-418
-I rebooted the machine
-nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

So, no success yet. Everything was working fine until i started following instructions to install tensorflow 2.0 on Ubuntu 18.04(Cuda 10) from this page.

When I open up software updates, this is what shows up (attached snapshots)

Please run

grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*

to check if a file exists containing

blacklist nvidia

delete it, run

sudo update-initramfs -u

afterwards and reboot.
Also, please make sure the packages nvidia-prime and ubuntu-drivers-common are installed and run

sudo prime-select nvidia

If it still doesn’t work, please create a new nvidia-bug-report.log and attach it.

To install cuda, use
sudo apt install cuda-toolkit-10-0
after downloading and adding the .deb, don’t use apt install cuda, this overwrites the driver.

1 Like

I found a file under /lib/modprobe.d/ that contained blacklist nvidia. I deleted it and ran the update-initramfs command and then rebooted. Now it shows me the login screen GUI. But when I enter my password, it just hangs there and nothing happens.

I have connected my monitors to the motherboard, instead of the graphics card slots. Does this have anything to do with this?

I am trying to have my monitors use the integrated intel graphics, while I dedicate the GPU just for my ML/deeplearning experiments.

When i boot into the recovery mode and type in nvidia-smi at the root shell prompt, it seems to be working and shows my the GPU, driver version, etc.

Follow this procedure:
https://devtalk.nvidia.com/default/topic/1043405/linux/ubuntu-18-04-headless_390-intel-igpu-after-prime-select-intel-lost-contact-to-geforce-1050ti/post/5293003/#5293003

Thanks a lot! Everything related to the drivers and Cuda works well.

Except that I am getting these cudnn errors after installing ‘pip install tensorflow-gpu==2.0.0-alpha0’ and running a sample TF2.0 file.

2019-03-24 17:42:45.431677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1015] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-24 17:42:45.431687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 0
2019-03-24 17:42:45.431691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1034] 0: N
2019-03-24 17:42:45.431940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1149] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7430 MB memory) → physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5)
2019-03-24 17:42:45.987544: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-03-24 17:42:46.149791: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-03-24 17:42:46.730967: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-03-24 17:42:46.733721: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-03-24 17:42:46.733736: F tensorflow/core/kernels/conv_grad_input_ops.cc:955] Check failed: stream->parent()->GetConvolveBackwardDataAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(stream->parent()), &algorithms)
Aborted (core dumped)

I have installed both libcudnn7=7.4.1.5-1+cuda10.0 and libcudnn7-dev=7.4.1.5-1+cuda10.0

Try removing the ~/.nv directory.

Tried removing the folder and running again. I am getting the same error.

Please check if this applies:
[url]https://github.com/tensorflow/tensorflow/issues/24496[/url]

Thanks again! The following snippet made it work!

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

This solved my issue. Thank you!