I am currently working with an Azure NVadsA10_v5 VM running Ubuntu 22.04 Linux, and I am encountering persistent issues while installing NVIDIA drivers, CUDA packages, and cuDNN to enable GPU capabilities.
Despite following all the recommended steps, including:
Disabling Secure Boot,
Ensuring kernel compatibility,
Reinstalling different NVIDIA driver versions (nvidia-driver-535, nvidia-driver-550, nvidia-driver-535-server, etc.),
I still face the same issue when I run the nvidia-smi command:
NVIDIA-SMI has failed because it couldnāt communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
What Iāve Tried
Installed the recommended driver versions for CUDA compatibility.
Verified kernel versions and rebuilt DKMS modules.
Disabled Secure Boot.
Followed blogs and documentation for reinstalling NVIDIA drivers and CUDA packages.
This is how I got it working for Standard_NV72ads_A10_v5 :
Installing the NVIDIA Driver on an Azure VM
Prerequisites
This guide is specifically for Azure VMs using GRID drivers for Azure.
The VM must be created in Standard mode to disable Trusted Launch.
1. Connect to Your VM
Use SSH to connect to your Azure VM.
ssh your-username@your-vm-ip-address
2. Update the Package List
Before installing new packages, update the package list.
sudo apt-get update
3. Install Necessary Packages
Install the required packages for building the NVIDIA driver.
sudo apt-get install -y build-essential
4. Blacklist Nouveau Drivers
Ensure that the Nouveau drivers are blacklisted to prevent conflicts.
echo "blacklist nouveau" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
echo "options nouveau modeset=0" | sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf
sudo update-initramfs -u
5. Reboot the VM
Reboot your VM to apply changes.
sudo reboot
6. Download Driver File and Make the Driver File Executable
Download the GRID driver file for Azure and change its permissions to make it executable.
Hello, Simon! Your comment is one of the most recent and relevant to my issue. Iām stuck on step 7. When I try to run the NVIDIA installer, I get the following error:
ERROR: An error occurred while performing the step: āBuilding kernel modulesā. See /var/log/nvidia-installer.log for details.
ERROR: An error occurred while performing the step: āChecking to see whether the NVIDIA kernel module was successfully builtā. See /var/log/nvidia-installer.log for details.
ERROR: The NVIDIA kernel module was not created.
ERROR: Installation has failed. Please see the file ā/var/log/nvidia-installer.logā for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Are you using Linux with Kernel 6.11 (ubuntu 24.04 for example)?
If yes, there are some well known issuesā¦
Since my last post, I started using the Azure Nvidia Gpu driver extension to install the Nvidia driver.
I provision the VM in standard mode (trusted launch disabled).
Then I install the extension using the driver version specified in Point 2..
If your VM is using Kernel 6.11, youāll have to consider Point 1..
Thanks @simon.renuart . After a couple of days dealing with this problem, I was finally able to solve it. The underlying issue was the kernel, just as you pointed out.