Hello,
I’m setting up a new server with several Tesla V100 running Ubuntu 18.04 LTS. I’ve followed the instructions to install CUDA 10.2 but am seeing “System Not Initialized” errors when I try and run any of the samples. Here is the complete list of what I’ve done-
- Download CUDA Toolkit 10.2 for Ubuntu 18.04 from Nvidia’s website (the local .run version)
- Run the installer, installing the packaged drivers (v440.33.01) along with the libraries and binaries. This completes with no errors
- Rebooted the machine
- Blacklisted the Nouveau driver following the Linux install guide recommendations
- Rebooted the machine
- Run nvidia-smi. This shows the correct driver version and all GPUs
- Built the deviceQuery example included with CUDA, this builds fine.
- Tried to run the deviceQuery executable, this gives an error- “cudaGetDeviceCount returned 802 → system not yet initialized Result = FAIL”
- Some googling led me to the the Nvidia Developer Forum and then this page from SuperMicro- FAQ Entry | Online Support | Support - Super Micro Computer, Inc.
- I installed the Nvidia fabric manager with no errors
- I started the nvidia fabric manager service and was able to run nv-hostengine with no errors
- I rebooted the machine again
- On startup, The kernel reports a driver issue with nvidia-nvswitch0: “Fatal, Link 03 DL LTSSM Fault” but continues to start
- Tried to run deviceQuery again, the same error message, but now the function reports returning “83” which is not a documented CUDA error code.
Some other things I’ve noticed, I have NvSwitch devices in /dev alongside the GPU devices. The NvSwitch devices have default permissions 600. I’ve modified that to 666 based on the guidance in the CUDA install documentation. Restarting the nvidia-fabricmanager service resets the device permissions to 600 every time (I’m not sure if this desired). Starting the nvidia-fabricmanager service does not start nv-hostengine by default, I have to do that manually. After starting nv-hostengine I can run dcgmi
queries. dcgmi discovery -l
reports all GPUs but no NvSwitches.