Ubuntu 22.04 with A100: Cannot get drivers working

Hey,

I am currently trying to setup a workstation running Ubuntu 22.04LTS with an A100 GPU for machine learning tasks.
The workstation is not intended to be used in a headless mode, i.e., we want to have a screen attached using the intel onboard gpu for the display.

After trying various combinations of drivers and guides I have not managed to get a working system. I.e., I cant manage to get around

$  nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

even though the driver installation works fine. I did install version 535 from the deb package.

$ sudo dmesg
[  224.448228] NVRM: None of the NVIDIA devices were initialized.
[  224.448522] nvidia-nvlink: Unregistered Nvlink Core, major device number 234
[  224.799882] nvidia-nvlink: Nvlink Core is being initialized, major device number 234

[  224.800833] nvidia 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  224.803676] NVRM: The NVIDIA GPU 0000:02:00.0
               NVRM: (PCI ID: 10de:20f1) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[  224.803718] nvidia: probe of 0000:02:00.0 failed with error -1
[  224.803734] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  224.803735] NVRM: None of the NVIDIA devices were initialized.
[  224.803830] nvidia-nvlink: Unregistered Nvlink Core, major device number 234
[  225.135010] nvidia-nvlink: Nvlink Core is being initialized, major device number 234

Other information about the system:

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.3 LTS
Release:	22.04
Codename:	jammy
$uname -r 
6.2.0-34-generic

Any help very much appreciated.

Thanks,
Michael

Hi @michael.mayer91113 and welcome to the NVIDIA developer forums.

First of all I recommend running nvidia-bug-report.sh and checking the resulting output log for more information. Please also attach it to this post.

“Fallen off the bus” most often happens if there is an issue with the PCIe BUS in terms of power supply, temperatures or other PCIe issues. You can look for PCI related kernel messages regarding address the GPU 0000:02:00.0 and see if there are any system warnings or errors.

Sometime it is as simple as re-seating the GPU in its slot or a different slot.

Thanks!