Struggling with installing drivers for A100 on Super Micro motherboard

Hi,
I have been struggling to install all of the software necessary to run cuDNN on my A100.
I am running Ubuntu 22.04 on a Super Micro motherboard (Supermicro MBD-X13SEI-F-O ATX Server Motherboard) with 128GB memory, and am looking to install CUDA 11.8, as recommended in the Nvidia documentation.
I believe that I’ve successfully installed MLNX_OFED and GDS (2 nvme drives) but seem to be stopped from moving forward since I can’t install the drivers for my A100.
I have been reading through several of these posts, but many of these are 5+ years old on previous versions of Linux and different PC manufacturers. So, it’s not clear to me what applies or what has been addressed.

When I run nvidia-smi, I simply receive the generic error message:
“NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”
At the end of this message are the repositories I am referencing.

I suspect that DKMS may be involved somehow, since when I run ‘dkms status’, I get the output:
(base) root@brian:~# dkms status
iser/23.07, 6.2.0-35-generic, x86_64: installed
iser/23.07, 6.2.0-36-generic, x86_64: installed
isert/23.07, 6.2.0-35-generic, x86_64: installed
isert/23.07, 6.2.0-36-generic, x86_64: installed
kernel-mft-dkms/4.25.0, 6.2.0-35-generic, x86_64: installed
kernel-mft-dkms/4.25.0, 6.2.0-36-generic, x86_64: installed
knem/1.1.4.90mlnx2, 6.2.0-35-generic, x86_64: installed
knem/1.1.4.90mlnx2, 6.2.0-36-generic, x86_64: installed
mlnx-nfsrdma/23.07, 6.2.0-35-generic, x86_64: installed
mlnx-nfsrdma/23.07, 6.2.0-36-generic, x86_64: installed
mlnx-nvme/23.07, 6.2.0-35-generic, x86_64: installed
mlnx-nvme/23.07, 6.2.0-36-generic, x86_64: installedError! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia-fs/2.18.3/source/dkms.conf does not exist.

mlnx-ofed-kernel/23.07, 6.2.0-35-generic, x86_64: installed
mlnx-ofed-kernel/23.07, 6.2.0-36-generic, x86_64: installed
nvidia/545.23.06: added

Could there be conflicts between 6.2.0-35-generic and 6.2.0-36-generic?

Can anyone suggest ways to resolve the above ‘Error!’, to check if my A100 is functioning correctly, ways to ensure my system has the correct files to compile the new drivers, or anything else that can get me past my configuration issues?
I’m excited to begin using this A100 but am amazed at just how difficult it has been to get the software configured.
Any help is appreciated and will be met with my undying gratitude,
Brian

REPOSITORIES
(base) root@brian:~# apt-cache policy
Package files:
100 /var/lib/dpkg/status
release a=now
500 Index of /repos/edge/ stable/main amd64 Packages
release o=edge stable,a=stable,n=stable,l=edge stable,c=main,b=amd64
origin packages.microsoft.com
500 Index of /danielrichter2007/grub-customizer/ubuntu jammy/main amd64 Packages
release v=22.04,o=LP-PPA-danielrichter2007-grub-customizer,a=jammy,n=jammy,l=Launchpad PPA for Grub Customizer,c=main,b=amd64
origin ppa.launchpadcontent.net
600 Index of /compute/cuda/repos/ubuntu2204/x86_64 Packages
release o=NVIDIA,l=NVIDIA CUDA,c=
origin developer.download.nvidia.com
500 Index of /christian-boxdoerfer/fsearch-stable/ubuntu jammy/main amd64 Packages
release v=22.04,o=LP-PPA-christian-boxdoerfer-fsearch-stable,a=jammy,n=jammy,l=FSearch,c=main,b=amd64
origin ppa.launchpadcontent.net
500 Index of /graphics-drivers/ppa/ubuntu jammy/main i386 Packages
release v=22.04,o=LP-PPA-graphics-drivers,a=jammy,n=jammy,l=Proprietary GPU Drivers,c=main,b=i386
origin ppa.launchpadcontent.net
500 Index of /graphics-drivers/ppa/ubuntu jammy/main amd64 Packages
release v=22.04,o=LP-PPA-graphics-drivers,a=jammy,n=jammy,l=Proprietary GPU Drivers,c=main,b=amd64
origin ppa.launchpadcontent.net
500 Index of /ubuntu jammy-security/multiverse i386 Packages
release v=22.04,o=Ubuntu,a=jammy-security,n=jammy,l=Ubuntu,c=multiverse,b=i386
origin security.ubuntu.com
500 Index of /ubuntu jammy-security/multiverse amd64 Packages
release v=22.04,o=Ubuntu,a=jammy-security,n=jammy,l=Ubuntu,c=multiverse,b=amd64
origin security.ubuntu.com
500 Index of /ubuntu jammy-security/universe i386 Packages
release v=22.04,o=Ubuntu,a=jammy-security,n=jammy,l=Ubuntu,c=universe,b=i386
origin security.ubuntu.com
500 Index of /ubuntu jammy-security/universe amd64 Packages
release v=22.04,o=Ubuntu,a=jammy-security,n=jammy,l=Ubuntu,c=universe,b=amd64
origin security.ubuntu.com
500 Index of /ubuntu jammy-security/restricted i386 Packages
release v=22.04,o=Ubuntu,a=jammy-security,n=jammy,l=Ubuntu,c=restricted,b=i386
origin security.ubuntu.com
500 Index of /ubuntu jammy-security/restricted amd64 Packages
release v=22.04,o=Ubuntu,a=jammy-security,n=jammy,l=Ubuntu,c=restricted,b=amd64
origin security.ubuntu.com
500 Index of /ubuntu jammy-security/main i386 Packages
release v=22.04,o=Ubuntu,a=jammy-security,n=jammy,l=Ubuntu,c=main,b=i386
origin security.ubuntu.com
500 Index of /ubuntu jammy-security/main amd64 Packages
release v=22.04,o=Ubuntu,a=jammy-security,n=jammy,l=Ubuntu,c=main,b=amd64
origin security.ubuntu.com
100 Index of /ubuntu jammy-backports/universe i386 Packages
release v=22.04,o=Ubuntu,a=jammy-backports,n=jammy,l=Ubuntu,c=universe,b=i386
origin us.archive.ubuntu.com
100 Index of /ubuntu jammy-backports/universe amd64 Packages
release v=22.04,o=Ubuntu,a=jammy-backports,n=jammy,l=Ubuntu,c=universe,b=amd64
origin us.archive.ubuntu.com
100 Index of /ubuntu jammy-backports/main i386 Packages
release v=22.04,o=Ubuntu,a=jammy-backports,n=jammy,l=Ubuntu,c=main,b=i386
origin us.archive.ubuntu.com
100 Index of /ubuntu jammy-backports/main amd64 Packages
release v=22.04,o=Ubuntu,a=jammy-backports,n=jammy,l=Ubuntu,c=main,b=amd64
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy-updates/multiverse i386 Packages
release v=22.04,o=Ubuntu,a=jammy-updates,n=jammy,l=Ubuntu,c=multiverse,b=i386
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy-updates/multiverse amd64 Packages
release v=22.04,o=Ubuntu,a=jammy-updates,n=jammy,l=Ubuntu,c=multiverse,b=amd64
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy-updates/universe i386 Packages
release v=22.04,o=Ubuntu,a=jammy-updates,n=jammy,l=Ubuntu,c=universe,b=i386
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy-updates/universe amd64 Packages
release v=22.04,o=Ubuntu,a=jammy-updates,n=jammy,l=Ubuntu,c=universe,b=amd64
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy-updates/restricted i386 Packages
release v=22.04,o=Ubuntu,a=jammy-updates,n=jammy,l=Ubuntu,c=restricted,b=i386
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy-updates/restricted amd64 Packages
release v=22.04,o=Ubuntu,a=jammy-updates,n=jammy,l=Ubuntu,c=restricted,b=amd64
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy-updates/main i386 Packages
release v=22.04,o=Ubuntu,a=jammy-updates,n=jammy,l=Ubuntu,c=main,b=i386
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy-updates/main amd64 Packages
release v=22.04,o=Ubuntu,a=jammy-updates,n=jammy,l=Ubuntu,c=main,b=amd64
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy/multiverse i386 Packages
release v=22.04,o=Ubuntu,a=jammy,n=jammy,l=Ubuntu,c=multiverse,b=i386
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy/multiverse amd64 Packages
release v=22.04,o=Ubuntu,a=jammy,n=jammy,l=Ubuntu,c=multiverse,b=amd64
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy/universe i386 Packages
release v=22.04,o=Ubuntu,a=jammy,n=jammy,l=Ubuntu,c=universe,b=i386
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy/universe amd64 Packages
release v=22.04,o=Ubuntu,a=jammy,n=jammy,l=Ubuntu,c=universe,b=amd64
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy/restricted i386 Packages
release v=22.04,o=Ubuntu,a=jammy,n=jammy,l=Ubuntu,c=restricted,b=i386
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy/restricted amd64 Packages
release v=22.04,o=Ubuntu,a=jammy,n=jammy,l=Ubuntu,c=restricted,b=amd64
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy/main i386 Packages
release v=22.04,o=Ubuntu,a=jammy,n=jammy,l=Ubuntu,c=main,b=i386
origin us.archive.ubuntu.com
500 Index of /ubuntu jammy/main amd64 Packages
release v=22.04,o=Ubuntu,a=jammy,n=jammy,l=Ubuntu,c=main,b=amd64
origin us.archive.ubuntu.com
Pinned packages:
nsight-compute → 2021.3.1.4~11.5.1-1ubuntu1 with priority -1
nsight-systems → 2021.3.3.2~11.5.1-1ubuntu1 with priority -1

Hi @briansboyd ,
Can you pls share the dmesg logs?

Thanks

Thanks Aakankshas! Here are all of these logs as attachments. Looking forward to hearing back! I will be very responsive this weekend, all week and the following weekend. Hopefully we can get the A100 driver installed and working with the cuDNN during that time. Brian

dmesg.1.gz (26.1 KB)

dmesg.2.gz (25.5 KB)

dmesg.3.gz (26.1 KB)

dmesg.4.gz (25.6 KB)

dmesg_for_AakankshasNvidia.txt (128 KB)

Hi @briansboyd ,
Sincere apologies for teh delay, but this doesnt look like a cudnn related issue, and Drivers Forum may help you better.
Moving it.
Thanks

Hello @briansboyd,

Since we are talking A100, did you already contact Enterprise support? These GPUs usually come through ISVs or similar and go with support contracts if I am not mistaken. That way you should havr faster and more direct access to personalized help.

That said, to answer one of your questions, yes, having two kernel versions installed can cause issues.

In your description I only see mention of Mellanox driver? That is not even installing NVIDIA GPU drivers. On the other hand nvidia-smi is installed, pointing towards some (unsuccessful?) installation of the driver.

Can you run sudo nvidia-bug-report.sh and attach the resulting log file here?

The installation instructions for CUDA were reviewed lately and have become more clear in later revisions, but can easily be transferred to earlier CUDA versions. You might want to check those and see if you overlooked some step in your setup process.

Thanks!

1 Like

Thanks for your response @MarkusHoHo and all others who worked with me on this. I actually bought this A100 off of Amazon and built a custom SuperMicro system around it. I worked with a company to build the hardware and install Linux, and they apparently installed some driver that corrupted that Linux installation.
Long story short, I erased that installation and installed a fresh one, and then followed the instructions here: How to Install CUDA on Ubuntu 22.04 | Step-by-Step | Cherry Servers

I’ve just installed, and so haven’t run any jobs, but am excited to have now gotten this far.

Brian

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.