We’ve recently purchased a generic server with 6 A100 PCIe cards and a dedicated HGX-A100.
The result of HPL and HPCG benchmarks using NGC images are within expected range.
If the performance is confined within the first NVLink group and socket, the HPL-AI results are follow:
For 4 A100 with PCIe, the performance is ~ 100 TFlops
For 4 A100 with SMX4, the performance is ~ 400 TFlops
I think the PCIe version is grossly under performed, so:
- What is the reason for such large difference ?
- How can I debug this issue and improve performnace of PCIe version for AI application ?