Problem starting Cuda Driver on Ubuntu 20.04

Hi guys.

I am to install a T4 on a Ubuntu 20.04 server which has no monitor.

Thereby, I run into problems.

When I type

find /usr/lib/modules -name nvidia.ko -exec modinfo {} \ ;

into the console, I get informed, that I successfully installed version 460.84.

When I write

nvidia-smi

I get returned that

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

From what I understand, this means, that the driver has been installed, yet it is not running.
When I type

nvcc -V

I get informed, that I have Cuda compilation tools, release 10.1, V10.1.143.

From what I understand, I somehow have to start the driver. How do I do this?

Edit: I did not realize, that i have to write “\” with double “\”.

1 Like

Hi, meanwhile, I followed the Linux Installation Guide to no avail.

I began with a cleanup as in the end of the installation guide.

sudo apt-get --purge remove “cublas” “cufft” “curand” “cusolver” “cusparse” “npp” “nvjpeg” “cuda*” “nsight*”
sudo apt-get --purge remove “nvidia
sudo apt-get autoremove
sudo reboot

Following the reboot, I follow the install instrcutions from chapter 2 onward:

find out, what GPU I got

lspci | grep -i nvidia

find out, what Linux version I am running

uname -m && cat /etc/*release

find out my gcc version

gcc --version

install the kernel headers and install packages for my version of Linux

sudo apt-get install linux-headers-$(uname -r)

get the .pin-file

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin

move that file

sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600

get the installer

wget https://developer.download.nvidia.com/compute/cuda/11.3.1/local_installers/cuda-repo-ubuntu2004-11-3-local_11.3.1-465.19.01-1_amd64.deb

run the installer

sudo dpkg -i cuda-repo-ubuntu2004-11-3-local_11.3.1-465.19.01-1_amd64.deb

install the gpg key

sudo apt-key add /var/cuda-repo-ubuntu2004-11-3-local/7fa2af80.pub

update

sudo apt-get update

install

sudo apt-get -y install cuda

during the install, there was a warning: “the home dir /nonexistent you specified can’t be accessed, no such file or directory”.

update PATH

export PATH=/usr/local/cuda-11.3/bin${PATH:+:${PATH}}

reboot

sudo reboot

verify installation

cat /proc/driver/nvidia/version

This returns "NVRM version: NVIDIA UNIX x86_64 Kernel Module 465.19.01 Fri Mar 19 07:44:41 UTC 2021
GCC version: gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
"

install nvcc

sudo apt install nvidia-cuda-toolkit

reboot

sudo reboot

Following the installation guide, I attempt to compile the samples:

cd /NVIDIA_CUDA-11.3_Samples

here, I get the error message, that this directory does not exist.

nvidia-smi

fails “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”

I would be delighted, if you could help me.

Hi there.

In the attachment, you can find my bug-report generated using

sudo nvidia-bug-report.sh

nvidia-bug-report.log.gz (14.5 MB)