Issue when upgrading cuda driver to R470 - DGX2

I am trying to upgrade my DGX-2 system nvidia driver to R470 (from R450). I followed the DGX OS 5 user guide from this link: DGX OS 5 User Guide :: DGX Systems Documentation
After rebooting the machine, I can see the new driver has been installed with nvidia-smi. However, I cannot use cuda in my pytorch code as it produced this error

>>> import torch
>>> torch.cuda.is_available()
/home/tungch/anaconda3/lib/python3.9/site-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
False

I tried to run deviceQuery from cuda sample and it gave me the following error:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 802
-> system not yet initialized
Result = FAIL

I did some further investigations on nvidia forums, and there seems to be a problem with my machine nvidia-fabricmanager. When I tried to start the service using sudo systemctl start nvidia-fabricmanager, it did not start and produce the error

nv-fabricmanager[114573]: failed to acquire required privileges to access NVSwitch devices. make sure fabric manager has access permissions to required device node files

I tried some search but found none of related issue with solution. How can I resolve this?

Thank you.

Can you check that you have all the required packages installed? Specifically, is the 470 version of the nvidia-fabricmanager package installed? Could you post the first few lines of the output from nvidia-smi?

To get a list of all packages starting with the name nvidia, run:
apt list nvidia-*

You could also try uninstalling and re-installing the 470 driver again. Note that the instructions include a step to uninstall the driver first as a workaround to an issue.

Did you also upgrade the kernel at the same time? The driver packages have a dependency on the kernel version, so, for example. if you installed the drivers first and then the kernel, there could be a mismatch.

Here are the few first lines of nvidia-smi. The driver has been successfully installed

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
... 7 other gpus

There are many existing packages starting with nvidia, for example nvidia-utils 450, 455, 460… The installed version is nviida-utils-470-server

nvidia-utils-470-server/focal-updates,focal-security,now 470.82.01-0ubuntu0.20.04.2 amd64 [installed,automatic]

I did try uninstalling and re-installing the 470 driver, following the guideline provided in the DGX OS 5 User Guide link above. I uninstalled all the existing driver, cuda, reinstalled everything and reboot. It did not solve the issue.

About the kernel, I also followed the DGX OS 5 User Guide by running sudo apt install -y linux-generic first, and then install needed nvidia driver packages.

Sorry, have to ask, you do have the 470 version of the fabric driver installed, correct? Should be th nvidia-fabric-470 package.

Yes, I have. Running apt list nvidia-fabric* gives me a list of fabricmanager version, and the installed one is nvidia-fabricmanager-470

nvidia-fabricmanager-470/focal-updates,now 470.82.01-0ubuntu0.20.04.1 amd64 [installed]

I’m trying to find out more about the error but could take some time. Feel also free to reach out to customer support.

1 Like

Thank you. I will update the status if the problem is resolved.

I don’t really understand why you cannot start the farbic manager service. I got this additional information from our engineering team:

This error indicates a permission issue when opening the NVSwitch device. All access/IOCTL is controlled by special fabric management devices.
The actual management devices dependent on the driver mechanism: /dev/nvidia-caps/nvidia-capX for devfs and /proc/driver/nvidia-nvlink/capabilities/fabric-mgmt for procfs.
These entries are created by the driver and the default user is set to root. (The Fabric Manager typically runs in root context to access them).
In this case, some default settings/permission seems to have changed on the machine.

Is the FM running as root? The FM systemd service unit file sets the default User=root entry.

Can you also check these files?

ls -l /dev/nvidia*
ls -l /dev/nvidia-caps*
ls -l /proc/driver/nvidia-nvlink/capabilities/

I did try to run nvidia-fabricmanager as root (the actual root user, not by sudo) and the same error occured.

Here are the files that you required
ls -l /dev/nvidia*
image

ls -l /dev/nvidia-caps*
image

ls -l /proc/driver/nvidia-nvlink/capabilities/
image

Does that mean there are some problems with nvidia-caps*?

It doesn’t seem to be the /dev/nvidia-caps directory. I’m not sure it is a SW issue, actually,.

You could check the log file again, but it might just show the same information we already know:

sudo journalctl -u nvidia-fabricmanager

Another option could be to install the older 450 driver.

Yes, the log just showed that the fabricmanager service stopped, which didn’t help a lot.

I did try to reinstall the older 450 driver, using the exact guide of the above guidelines for DGX2, and the same problem occurred even after I reboot or manually start the fabricmanager service.

Is there anything recorded in the syslog for the fabric manager? My fear is that it could be a HW issue. I have not seen anything reported about this issue.

Running less /var/log/syslog gives me the follwing error related to fabricmanager

fabric manager NVIDIA GPU driver interface version 450.156.00 don't match with driver version 450.80.02. Please update with matching NVIDIA driver package

(The driver was 450 because I tried to reinstall the older R450 driver)

Does this mean I have to install the exact fabricmanager version of 450.80.02? How can that be achieved?

Yes, the versions need to match. All these drivers are compiled with the current kernel, and kernel headers are ‘hashed’ and validated with the drivers.
For DGX, we have specific apt configurations (repositories and preferences) to ensure the correct packages are installed.
Have you made any changes or manually installed only some packages? Typically, apt update; apt upgrade installs the latest and compatible versions.

Tried to reinstall the corresponding fabricmanager version, the first error still occurred

nv-fabricmanager[1755846]: failed to acquire required privileges to access NVSwitch devices. make sure fabric manager has access permissions to required device node files

Driver version is 450.80.02 (showed by nvidia-smi), nvidia-fabricmanager version is 450.80.02-1 (showed by apt list --installed nvidia-fabricmanager* -a)

Version look correct. Let me reach out per DM to you.