System Not Initialized (ReturnCodes 802 and 83)

jordan.cheney · March 30, 2020, 10:45pm

Hello,

I’m setting up a new server with several Tesla V100 running Ubuntu 18.04 LTS. I’ve followed the instructions to install CUDA 10.2 but am seeing “System Not Initialized” errors when I try and run any of the samples. Here is the complete list of what I’ve done-

Download CUDA Toolkit 10.2 for Ubuntu 18.04 from Nvidia’s website (the local .run version)
Run the installer, installing the packaged drivers (v440.33.01) along with the libraries and binaries. This completes with no errors
Rebooted the machine
Blacklisted the Nouveau driver following the Linux install guide recommendations
Rebooted the machine
Run nvidia-smi. This shows the correct driver version and all GPUs
Built the deviceQuery example included with CUDA, this builds fine.
Tried to run the deviceQuery executable, this gives an error- “cudaGetDeviceCount returned 802 → system not yet initialized Result = FAIL”
Some googling led me to the the Nvidia Developer Forum and then this page from SuperMicro- FAQ Entry | Online Support | Support - Super Micro Computer, Inc.
I installed the Nvidia fabric manager with no errors
I started the nvidia fabric manager service and was able to run nv-hostengine with no errors
I rebooted the machine again
On startup, The kernel reports a driver issue with nvidia-nvswitch0: “Fatal, Link 03 DL LTSSM Fault” but continues to start
Tried to run deviceQuery again, the same error message, but now the function reports returning “83” which is not a documented CUDA error code.

Some other things I’ve noticed, I have NvSwitch devices in /dev alongside the GPU devices. The NvSwitch devices have default permissions 600. I’ve modified that to 666 based on the guidance in the CUDA install documentation. Restarting the nvidia-fabricmanager service resets the device permissions to 600 every time (I’m not sure if this desired). Starting the nvidia-fabricmanager service does not start nv-hostengine by default, I have to do that manually. After starting nv-hostengine I can run dcgmi queries. dcgmi discovery -lreports all GPUs but no NvSwitches.

bill.whiteley · May 11, 2020, 4:54pm

I am having the exact same problem with the same symptoms on an HGX-2 installation. Did you find a resolution?

jordan.cheney · May 11, 2020, 8:05pm

Hi Bill,

I did- we had a physical connection issue with the box. We rotated and reseated each of GPUs and the error was fixed.

Best,
Jordan

bill.whiteley · May 11, 2020, 9:31pm

Thanks. That could be very helpful insight.

caohoangtung2001 · January 22, 2022, 5:16am

I am having the same issue as well on a DGX-2. It occurred when I tried to upgrade existing nvidia driver to 470.82.01. The detail error was

failed to acquire required privileges to access NVSwitch devices. make sure fabric manager has access permissions to required device node files

Can you share your solution to this problem?

jordan.cheney · January 22, 2022, 5:25am

Hello,

My solution was to physically take out and re-seat the GPU cards within the server. My server was a SuperMicro server (not a DGX-2) that had just been shipped, which could have been the cause of the physical issue. Based on the fact that your machine worked before a software update, but not afterwards, I would speculate you have a different issue than what I faced.

caohoangtung2001 · January 22, 2022, 5:50am

Thank you for sharing. I created a related thread here for others who may find a solution to the problem