Hi, I added A100X to my server and ran nvidia-smi, but an ‘ERR!’ occurred in the GPU’s Pwr:Usage/Cap.
Here is my lstopo. Yes, We using a mix of 5 A100s and 3 A100Xs.
I tried Avg. bus bandwidth test, the result of A100X(‘ERR!’ occured) and A100 combined showed lower throughput(1.25 GB/s)
than other combination(1.63 GB/s)
Fortunately, after rebooting the server, the ‘ERR!’ disappeared (low throughput was also restored).
I have to report to my employer. Can I find out why I got an ‘ERR!’ ?
nvidia-bug-report.log.gz (3.3 MB)