NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

Description

Hi Team,

I want to do DL/transformer model training and inference on my laptop
So I am trying to install nvidia-driver, cuda toolkit and cudnn in my ubuntu 22.04 (dual boot).
So while installing nvidia driver i am facing issues
Below are the steps i have taken to install it.

  1. Initially i tried to install nvidia driver from the command “ubuntu-drivers autoinstall”. This caused me black screen issue while booting up the ubuntu, so i uninstalled it on recovery mode. I used sudo apt-get remove --purge ‘^nvidia-.*’ to uninstall nvidia
  2. While installing cuda toolkit from CUDA Toolkit 12.5 Downloads | NVIDIA Developer , i install nvidia-driver as below
    sudo apt-get install -y cuda-drivers
    and it didn’t seems to work so i uninstalled nvidia-driver
  3. Then i downloaded the run file from official source (Linux x64 (AMD64/EM64T) Display Driver | 550.90.07 | Linux 64-bit | NVIDIA) and installed run file. With that also nvidia-smi didn’t work
  4. I disabled the secure boot after nvidia driver installation but after laptop was not booting up. I am not sure what was the issue behind this
  5. Later i registered the public key generated on step 3 using mokutil --import “PUBLIC KEY PATH” and reboot after that and added that key in MOK managemnet blue screen. After this step also nvidia-smi command didn’t work
  6. Then i created public and private key using openssl before installing the nvidia driver and passed those keys while installing but ended up with below error from logs
    - SSL error:FFFFFFFF80000002:system library::No such file or directory: …/crypto/bio/bss_file.c:67
    - SSL error:10000080:BIO routines::no such file: …/crypto/bio/bss_file.c:75
    sign-file: Nvidia.key
    Failed to sign kernel module.

I am able to see public key registered in mockutil --list-enrolled command

Environment

GPU Type: geforce rtx 4090
Nvidia Driver Version: 550.90.07
CUDA Version: 12.5
CUDNN Version: didn’t install
Operating System + Version: ubuntu 22.04

Relevant Files

I am attaching the nvidia bug report
nvidia-bug-report.log (889.8 KB)

Log file while installing at step 6
nvidia-installer_withcustomkey.log (38.0 KB)

output of some helpfull commands i found on similar issues discussion

uname -r
6.5.0-41-generic

modprobe nvidia
modprobe: ERROR: could not insert ‘nvidia’: Key was rejected by service

dkms status
nvidia/550.90.07, 6.5.0-41-generic, x86_64: installed (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!) (WARNING! Diff between built and installed module!)

lsmod | grep -E “nouveau|nvidia”
nvidia_wmi_ec_backlight 12288 0
video 73728 4 nvidia_wmi_ec_backlight,dell_wmi,dell_laptop,i915
wmi 40960 8 dell_wmi_sysman,video,nvidia_wmi_ec_backlight,dell_wmi_ddv,dell_wmi,wmi_bmof,dell_smbios,dell_wmi_descriptor

dpkg -l | grep nvidia

I tried to uninstall other nvidia other versions as in above screenshot but it is not getting uninstalled. Commands tried as below

sudo apt-get remove --purge '^nvidia-.*'
sudo apt autoremove

Disabled nouveau as below

echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf; sudo update-initramfs -u

Any Help is appreciated @generix @MarkusHoHo

Best Regards

I have been facing the same issue in my tesla T4 aws server.
I was not able to run nvidia-smi command itself in that.
This is actually a driver issue.
I deleted my instance and created a new instance and did the setup again.

But it didnt work.
I still get the same

You got a black screen when installing the ubuntu provided, signed driver because you created an xorg.conf. Please delete it, uninstall the runfile driver and reinstall the signed driver from ubuntu repo.

@generix Can you help me with the uninstall command. I tried with the mentioned command but it is not getting uninstalled completely,

Could you please give more detail on driver from ubuntu repo?

All options are disabled for me in additional drivers if i am getting it correctly