Unable to start fabricmanager on Ubuntu 20.04 LTS server with A100

Hello,

I am using A100 GPU and Ubuntu 20.04 LTS server. I have installed nvidia driver 510.47.03 for it.

I refer to this official document to install my driver
NVIDIA HGX A100 Software User Guide

However, when I finished my installation, trying to run “nvidia-smi” results in “ NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”

Besides that, I can’t start nvidia-fabricmanager.

Ubuntu 20.04 LTS server
Tesla A100
510.47.03 driver
CUDA Toolkit 11.6

BugReport:nvidia-bug-report.log.gz (28.6 MB)

I don’t know why my bug report is so large. The size of the decompressed file reaches 800MB

Any help is greatly appreciated, thanks!

The logs are flooded with error messages that BAR1 isnn’t assigned. I can only see that you’re running this in passthrough mode on vmware.
Please enable “above 4G decoding” or “large/64bit BARs” and disable CSM in bios.
Then correctly set up the VM (EFI boot, enough mmio space)
https://blogs.vmware.com/apps/2018/10/how-to-enable-nvidia-v100-gpu-in-passthrough-mode-on-vsphere-for-machine-learning-and-other-hpc-workloads.html

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.