NVIDIA-SMI Shows ERR! at Power Usage on one of A100X GPU

Hi, I added A100X to my server and ran nvidia-smi, but an ‘ERR!’ occurred in the GPU’s Pwr:Usage/Cap.

Here is my lstopo. Yes, We using a mix of 5 A100s and 3 A100Xs.

I tried Avg. bus bandwidth test, the result of A100X(‘ERR!’ occured) and A100 combined showed lower throughput(1.25 GB/s)


than other combination(1.63 GB/s)

Fortunately, after rebooting the server, the ‘ERR!’ disappeared (low throughput was also restored).

I have to report to my employer. Can I find out why I got an ‘ERR!’ ?

nvidia-bug-report.log.gz (3.3 MB)