System Not Initialized (ReturnCodes 802 and 83)

Hello,

I’m setting up a new server with several Tesla V100 running Ubuntu 18.04 LTS. I’ve followed the instructions to install CUDA 10.2 but am seeing “System Not Initialized” errors when I try and run any of the samples. Here is the complete list of what I’ve done-

  1. Download CUDA Toolkit 10.2 for Ubuntu 18.04 from Nvidia’s website (the local .run version)
  2. Run the installer, installing the packaged drivers (v440.33.01) along with the libraries and binaries. This completes with no errors
  3. Rebooted the machine
  4. Blacklisted the Nouveau driver following the Linux install guide recommendations
  5. Rebooted the machine
  6. Run nvidia-smi. This shows the correct driver version and all GPUs
  7. Built the deviceQuery example included with CUDA, this builds fine.
  8. Tried to run the deviceQuery executable, this gives an error- “cudaGetDeviceCount returned 802 -> system not yet initialized Result = FAIL”
  9. Some googling led me to the the Nvidia Developer Forum and then this page from SuperMicro- https://www.supermicro.com/support/faqs/faq.cfm?faq=31029
  10. I installed the Nvidia fabric manager with no errors
  11. I started the nvidia fabric manager service and was able to run nv-hostengine with no errors
  12. I rebooted the machine again
  13. On startup, The kernel reports a driver issue with nvidia-nvswitch0: “Fatal, Link 03 DL LTSSM Fault” but continues to start
  14. Tried to run deviceQuery again, the same error message, but now the function reports returning “83” which is not a documented CUDA error code.

Some other things I’ve noticed, I have NvSwitch devices in /dev alongside the GPU devices. The NvSwitch devices have default permissions 600. I’ve modified that to 666 based on the guidance in the CUDA install documentation. Restarting the nvidia-fabricmanager service resets the device permissions to 600 every time (I’m not sure if this desired). Starting the nvidia-fabricmanager service does not start nv-hostengine by default, I have to do that manually. After starting nv-hostengine I can run dcgmi queries. dcgmi discovery -lreports all GPUs but no NvSwitches.

1 Like

I am having the exact same problem with the same symptoms on an HGX-2 installation. Did you find a resolution?

Hi Bill,

I did- we had a physical connection issue with the box. We rotated and reseated each of GPUs and the error was fixed.

Best,
Jordan

Thanks. That could be very helpful insight.

I am having the same issue as well on a DGX-2. It occurred when I tried to upgrade existing nvidia driver to 470.82.01. The detail error was

failed to acquire required privileges to access NVSwitch devices. make sure fabric manager has access permissions to required device node files

Can you share your solution to this problem?

Hello,

My solution was to physically take out and re-seat the GPU cards within the server. My server was a SuperMicro server (not a DGX-2) that had just been shipped, which could have been the cause of the physical issue. Based on the fact that your machine worked before a software update, but not afterwards, I would speculate you have a different issue than what I faced.

1 Like

Thank you for sharing. I created a related thread here for others who may find a solution to the problem