Nvidia driver issue on VM

Hi, I am having trouble using the GPU associated with my VM. Cuda does not seem to be able to locate the GPU - for example when running nvidia-smi it returns no devices. Similar to the issues that have been reported here (nvidia-smi "No devices were found" error ), when I run “dmesg | grep NVRM”, I receive a series of errors saying RmInitAdapterFailed and from the previously mention forum, people seem to have had to reinstall Ubuntu, managing dependancies in a particular order. Please could I have some help resolving this?

Which hypervisor/gpu?

Thanks for your response. We have installed this driver onto the system:

image (8)

The system is running 18.04 Ubuntu on an Azure Standard NC4as T4 v3 VM.

Hi generix, please see above extra details - let me know if you need anything else and thanks for looking into this.

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

re227-ml-neph-renal_download_2022-03-22_12-26-28.zip (46.9 KB)

Hi generix, please see attached, thanks again for your help (file is within the folder)

The rminit error seems to be rather azure specific.

0x63:0x55:2344

Did you already try to install the driver using the azure nvidia driver extension rather than manually?
https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/hpccompute-gpu-linux

Hi generix

I would try that but the VM has very restricted internet access because it is inside a live production environment. The link you provided indicates the VM needs full internet access as well. So to get around that I got the driver I have previously described onto the machine manually. Because the nvidia-smi command doesn’t work is that indicating the Nvidia GPU driver hasn’t been installed properly?

The driver is correctly installed but the real question is if this is the correct driver version. VMs are special, and the Microsoft docs are ambiguous regarding your specific instance.
Please check if the GRID-drivers have to be used instead:
https://docs.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup

The previous linked azure extension would auto-choose the correct driver, so without it, you’re left to trial-and-error finding the correct version.

Hi generix,

I managed to get GRID driver onto the machine manually from the link you provided and loaded it up. It gave me a ‘NVIDIA NVML Driver/library version mismatch’ error when attempting ‘nvidia-smi’ again. So i uninstalled any existing nvidia drivers by using ‘sudo /usr/bin/nvidia-uninstall’. i rebooted machine, and disabled x server by using ‘sudo init 3’ to allow the GRID nvidia driver to be installed again, directly through the command line without a user interface. Now the service appears to be running … I really appreciate your help with this and hope your contributions are rewarded somewhat!

The NCas T4 v3 instance seems to be a dual-use VM, though it begins with NC for compute, it seems to have virtual graphics usage enabled per default so the grid driver has to be used. Is that configurable, so that you can switch it to compute-only in the management interface?