Unable to load the 'nvidia-drm' kernel module on Ubuntu 18.04

ashoonya · March 24, 2019, 7:26pm

I tried solutions to this problem from different threads in this forum - but had no success so far. Purging old kernels, purging all nvidia installations, etc - none have worked so far. I have attached the bug report below. I would really appreciate help in fixing this issue.

nvidia-bug-report.log.gz (77.4 KB)

generix · March 24, 2019, 7:54pm

Looks like you installed the .run installer over the Ubuntu provided driver.

Don’t use the .run installers, use --uninstall to uninstall them
add the ubuntu graphics ppa [url]https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa[/url]
install the driver from that (sudo apt install nvidia-driver-418)

ashoonya · March 24, 2019, 10:15pm

Thanks for the quick response. I tried the following:
-sudo ./NVIDIA-Linux-x86_64-410.78.run --uninstall
-sudo add-apt-repository ppa:graphics-drivers/ppa
-sudo apt-get update
-sudo apt install nvidia-driver-418
-I rebooted the machine
-nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

So, no success yet. Everything was working fine until i started following instructions to install tensorflow 2.0 on Ubuntu 18.04(Cuda 10) from this page.

When I open up software updates, this is what shows up (attached snapshots)

generix · March 24, 2019, 10:26pm

Please run

grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*

to check if a file exists containing

blacklist nvidia

delete it, run

sudo update-initramfs -u

afterwards and reboot.
Also, please make sure the packages nvidia-prime and ubuntu-drivers-common are installed and run

sudo prime-select nvidia

If it still doesn’t work, please create a new nvidia-bug-report.log and attach it.

To install cuda, use
sudo apt install cuda-toolkit-10-0
after downloading and adding the .deb, don’t use apt install cuda, this overwrites the driver.

ashoonya · March 24, 2019, 10:49pm

I found a file under /lib/modprobe.d/ that contained blacklist nvidia. I deleted it and ran the update-initramfs command and then rebooted. Now it shows me the login screen GUI. But when I enter my password, it just hangs there and nothing happens.

I have connected my monitors to the motherboard, instead of the graphics card slots. Does this have anything to do with this?

I am trying to have my monitors use the integrated intel graphics, while I dedicate the GPU just for my ML/deeplearning experiments.

ashoonya · March 24, 2019, 11:01pm

When i boot into the recovery mode and type in nvidia-smi at the root shell prompt, it seems to be working and shows my the GPU, driver version, etc.

generix · March 24, 2019, 11:21pm

Follow this procedure:
https://devtalk.nvidia.com/default/topic/1043405/linux/ubuntu-18-04-headless_390-intel-igpu-after-prime-select-intel-lost-contact-to-geforce-1050ti/post/5293003/#5293003

ashoonya · March 25, 2019, 12:46am

Thanks a lot! Everything related to the drivers and Cuda works well.

Except that I am getting these cudnn errors after installing ‘pip install tensorflow-gpu==2.0.0-alpha0’ and running a sample TF2.0 file.

2019-03-24 17:42:45.431677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1015] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-24 17:42:45.431687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 0
2019-03-24 17:42:45.431691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1034] 0: N
2019-03-24 17:42:45.431940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1149] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7430 MB memory) → physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5)
2019-03-24 17:42:45.987544: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-03-24 17:42:46.149791: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-03-24 17:42:46.730967: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-03-24 17:42:46.733721: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-03-24 17:42:46.733736: F tensorflow/core/kernels/conv_grad_input_ops.cc:955] Check failed: stream->parent()->GetConvolveBackwardDataAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(stream->parent()), &algorithms)
Aborted (core dumped)

I have installed both libcudnn7=7.4.1.5-1+cuda10.0 and libcudnn7-dev=7.4.1.5-1+cuda10.0

generix · March 25, 2019, 9:46am

Try removing the ~/.nv directory.

ashoonya · March 25, 2019, 3:19pm

Tried removing the folder and running again. I am getting the same error.

generix · March 25, 2019, 3:38pm

Please check if this applies:
[url]https://github.com/tensorflow/tensorflow/issues/24496[/url]

ashoonya · March 25, 2019, 3:51pm

Thanks again! The following snippet made it work!

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

bklare · April 21, 2021, 8:56pm

This solved my issue. Thank you!

Topic		Replies	Views
Unable to load the 'nvidia-drm' kernel module. Ubuntu 18.04 Linux	14	22971	October 12, 2021
Ubuntu 18.04, GeForce GTX 1060 Mobile: Unable to load the 'nvidia-drm' kernel module Linux	5	5811	October 12, 2021
Driver installed, but kernel modules not loaded Linux	6	16173	May 13, 2024
Unable to use GPU with Tensorflow 2.1 + CUDA 10.1 on Ubuntu 18.04 Linux	3	9946	October 12, 2021
Cuda and Nvidia drivers failing to install on ubuntu CUDA Setup and Installation	8	7766	September 11, 2019
[Solved] Tensorflow 1.14 - Cuda 10.0 - GTX 970 - Ubuntu 18.04 CUDA Setup and Installation cuda , tensorflow , ubuntu	0	2569	January 27, 2021
Can't use any NVIDIA driver on Ubuntu 18.04 (4.15.0-39-generic) Linux	7	20408	October 12, 2021
384.98 Install Error on Ubuntu 16.04 Linux	3	9818	December 20, 2017
Unable to load the 'nvidia-drm' kernel module. Linux	2	1969	March 31, 2019
ERROR: Unable to load the 'nvidia-drm' kernel module Linux	6	4147	February 10, 2022

Unable to load the 'nvidia-drm' kernel module on Ubuntu 18.04

Related topics