A100 PCIe HPL-AI poor performance



We’ve recently purchased a generic server with 6 A100 PCIe cards and a dedicated HGX-A100.
The result of HPL and HPCG benchmarks using NGC images are within expected range.

If the performance is confined within the first NVLink group and socket, the HPL-AI results are follow:

For 4 A100 with PCIe, the performance is ~ 100 TFlops
For 4 A100 with SMX4, the performance is ~ 400 TFlops

I think the PCIe version is grossly under performed, so:

  1. What is the reason for such large difference ?
  2. How can I debug this issue and improve performnace of PCIe version for AI application ?