In what step is nvidia-smi supposed to be installed?

DavidParks21 · March 30, 2018, 5:17pm

I’m trying to get an Nvidia 970M working on a Linux Mint 18 laptop.

I’m trying this process:

sudo apt purge nvidia-*
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get install nvidia-390

The drivers seem to be installed, however I normally check things like this by using nvidia-smi. Currently I get nvidia-smi: command not found. I’m not sure where I should expect nvidia-smi to be installed in the process.

I’ve then installed the CUDA 9.1 toolkit from the NVIDIA download link. It complains that I have an unsupported configuration (CUDA 9.1 seems to support driver version 387), but I think I can ignore that.

I thought either the driver or the CUDA toolkit would install nvidia-smi. Am I wrong in that assumption?

Some additional info:

lspci -k | grep -EA3 'VGA|3D|Display'
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06)
	DeviceName:  Onboard IGD
	Subsystem: Micro-Star International Co., Ltd. [MSI] 4th Gen Core Processor Integrated Graphics Controller
	Kernel driver in use: i915
--
01:00.0 3D controller: NVIDIA Corporation GM204M [GeForce GTX 970M] (rev a1)
	Subsystem: Micro-Star International Co., Ltd. [MSI] GM204M [GeForce GTX 970M]
	Kernel modules: nvidiafb, nouveau, nvidia_390, nvidia_390_drm
04:00.0 Ethernet controller: Qualcomm Atheros Killer E220x Gigabit Ethernet Controller (rev 13)

Robert_Crovella · March 30, 2018, 5:23pm

The nvidia-smi utility normally gets installed in the driver install step. It cannot/does not get installed in any other installation step.

nvidia-smi is not mandatory for basic driver operation (obviously - it is an informational utility).

If you install a NVIDIA GPU driver using a repository that is maintained by NVIDIA, you will always get the nvidia-smi utility with any recent driver install.

Unfortunately, the ppa repository is not maintained by NVIDIA. The maintainers of that repository may have deleted that executable from their install package. (and there may be other differences as well)

For CUDA usage, I recommend that people not install their drivers from ppa.

Instead, follow published instructions by NVIDIA.

Get your installers from [url]Official Drivers | NVIDIA (only drivers) or [url]http://www.nvidia.com/getcuda[/url] (full CUDA toolkit installers, including drivers)
For CUDA, follow published install instructions:

[url]CUDA Toolkit Documentation

You won’t find any NVIDIA references suggesting install from ppa, AFAIK.

DavidParks21 · March 30, 2018, 5:46pm

I’ll try this again, I’ve failed to get the nvidia stuff to install in the past, but I’ll try again.

In section 3.6, on this line in the documentation it asks for and

sudo dpkg -i cuda-repo-<distro>_<version>_<architecture>.deb

What is the correct format for those values? An example would be really helpful in that part of the documentation. Architecture seems obvious at x86_64, but section 2.2 seems to talk about version, but the output is rather verbose and not clear what I should extract from it. Or is version 390? 387? Or perhaps this is 8.0, 9.0, 9.1, etc? And how do I know what version to use? And I can’t see what distro means at all. Kernel version? Some nvidia number? I searched the documentation on “distro” and nada.

Robert_Crovella · March 30, 2018, 6:09pm

This is just the filename of the deb file you download from http://www.nvidia.com/getcuda

One possible “exact” example is given here:

https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=debnetwork

sudo dpkg -i cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
                       <distro>   <version> <arch>

DavidParks21 · March 30, 2018, 6:48pm

Ah, I see, I needed to download the network deb file. Sorry, I’m not too accustomed to dpkg use.

The next issue that crops up is this line in the docs:

When installing using network repo:

$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<architecture>/7fa2af80.pub

I plug in the and you pointed to (ubuntu1604, amd64)

It produces the following error. I tried http as well.

sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/amd64/7fa2af80.pub
Executing: /tmp/tmp.bwyyOV6jed/gpg.1.sh --fetch-keys
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/amd64/7fa2af80.pub
gpgkeys: protocol `https' not supported
gpg: no handler for keyserver scheme `https'
gpg: WARNING: unable to fetch URI https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/amd64/7fa2af80.pub: keyserver error

Robert_Crovella · March 30, 2018, 7:34pm

follow the specific install instructions given to you on the network deb (or local deb) install page, it’s the same as the page I already pointed out to you:

please read everything on that whole page.

You have this:
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/amd64/7fa2af80.pub

the installer download page suggests this:

sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub

DavidParks21 · March 30, 2018, 7:57pm

Ah, I see, I was still trying to follow the original documentation.

So I’ve now successfully run these 4 commands from the download page:

sudo dpkg -i cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
sudo apt-get update
sudo apt-get install cuda

The documentation goes on to talk about installing cuda in section 3.7.1 and 3.7.2 which appear to have been taken care of in the download page instructions. 3.7.3 talks vaguely about meta packages, but I’m unclear if I’m supposed to install those. And section 4 onward appear unrelated to my installation.

Is the installation complete at this point? Should I expect nvidia-smi to be installed? If so where would it be? I still get command not found when I try to run it. I have been through a reboot.

I do find nvidia-smi in /usr/lib/nvidia-390/bin, but it gives me the following error if I run it:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

I’m not clear if that was installed by this installation. I thought I ran an apt purge nvidia* before this, but that could have been left around.

I apologize for all the questions, and appreciate the help. I feel like a klutz, but this is a really hard process to follow. And ironically I’ve done it quite a few times before successfully (though never without significant scarring). :)

Robert_Crovella · March 30, 2018, 8:22pm

After those steps, you are supposed to follow the post-install instructions in the install guide.

However, something is wrong with the driver install if you get “command not found” when trying to run nvidia-smi.

If your system is not cleaned up properly:

[url]Installation Guide Linux :: CUDA Toolkit Documentation

before you begin installing, it may lead to trouble. It’s really hard to diagnose what is wrong from a limited set of information, especially if you have tried many things before the proper install process. The standard install process has the highest probability of success from a clean OS load. If you have previous made various unsuccessful attempts to install the driver, and don’t do a proper clean up or else OS re-install, it can prevent the standard install process from working correctly.

DavidParks21 · March 30, 2018, 10:13pm

Is there perhaps anything else that needs to be done where there are two GPUs on the system?

I’ve spent the past few hours trying to very carefully clean the system by making sure every dpkg install for nvidia or cuda is removed from /var/lib/dpkg/ and deleting every reference to cuda I can find. Then reinstalling using both this method and the ppa (which basically everyone else recommends except nvidia).

Neither seem to work. I’ve then retried with Nvidia Prime so that I can at least see which GPU is in use. Incidentally I have gotten both prime and bumblebee working on this system in the past.

When I try to switch to the NVIDIA GPU and reboot it just comes back up on the intel card.

Reinstalling the entire OS is a hellish pursuit. And I’m pretty sure I’ll run into this problem again in the future, so I’d really like to learn how to solve it properly.

There are no meaningful errors in dmesg to indicate that the nvidia driver is failing. Is there a way to force linx to try to load the nvidia driver? Where are the configuration files that control the graphics drivers?

Robert_Crovella · March 30, 2018, 10:58pm

what is the output of:

dmesg |grep NVRM

?

DavidParks21 · March 30, 2018, 11:15pm

Blank, no lines containing NVRM in it.

I also notice that no matter how carefully I try to purge the old installation, I’m clearly not removing everything because the drivers still exist in the driver manager. I’m not sure that’s necessarily bad, but it tells me I’m not starting totally clean each time.

Robert_Crovella · March 30, 2018, 11:39pm

Then the driver is not loading. A properly installed driver will put a single message in the system log with NVRM in the message, indicating that it is loading (at OS boot time). If the driver is properly installed, and it can detect the GPU, but cannot load for some reason, it will put multiple NVRM messages in the log. There is no circumstance that the driver will put zero messages in the log, excepting that case where the driver is not actually installed (properly).

Either your driver is not installed correctly (would have to inspect apt-get install output carefully), or the nvidia gpu is not detectable by the driver. Your lspci output already indicates that the GPU is detectable. So I am thinking it is a broken install.

nadavbenedek · December 15, 2022, 12:56pm

Eventually, my solution for K80 tesla on ubuntu 22 was:

sudo apt-get remove --purge ‘^nvidia-.’
sudo apt-get remove --purge '^libnvidia-.’
sudo apt-get remove --purge ‘^cuda-.*’
sudo apt install nvidia-driver-470 nvidia-dkms-470
sudo reboot
nvidia-smi
sudo apt-get install nvidia-docker2
sudo systemctl daemon-reload
sudo systemctl restart docker

Robert_Crovella · December 16, 2022, 4:03pm

Drivers after R470 (specifically R495 and later, at least) do not support Kepler GPUs.

Topic		Replies	Views
CUDA 10 installation problems on Ubuntu 18.04 CUDA Setup and Installation	24	94484	December 11, 2020
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" Ubuntu 16.04 CUDA Setup and Installation	79	371278	March 19, 2021
[INFO]: Finished with code: 256 , [ERROR]: Install of driver component failed CUDA Setup and Installation	24	174409	September 29, 2024
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver CUDA on Windows Subsystem for Linux	33	22802	May 1, 2021
Cannot get nvidia-smi to work with 1050 and Ubuntu 18.04 Linux	11	2953	June 19, 2019
cuda install fail - ubuntu 14.04 CUDA Setup and Installation	8	3709	February 4, 2016
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. Linux	2	5487	August 16, 2019
Cuda 11.6 installation error and disabled nvidia-smi CUDA Setup and Installation	4	932	January 11, 2024
Nvidia Cuda Compiler not showing up in Linux 22.04 Linux cuda , linux , nvcc	24	18139	May 30, 2022
Upgrade from Ubuntu 18 to 20 messed up graphics drivers Linux	18	3340	January 31, 2022

In what step is nvidia-smi supposed to be installed?

Related topics