We are running hpc benchmark 21.4 on out GPU servers.
There is a pronounced discrepancy in term of HPL-AI performance as follow:
4 x A100 PCIe: ~ 110 TFlops
4 x A100 SMX4: ~ 485 TFlops
Performance results for HPC Benchmarks (i.e., HPL, HPL-AI, HPCG) are highly dependent on system topology.
It’s hard to provide much information without details of each system.
I can say this, HPL-AI is memory bound. Assuming the SXM4 system is a DGX A100 with NVLink’s 600GB/s and PCI-e systems is bottlenecks by the Gen4 64GB/s, I’m not surprised by the results.
Note, I’m not aware of any official testing of PCIe cards with HPC Benchmarks.
When we deliver clusters to our customers, HPC benchmarks are routinely used to validate the installation.
Unfortunately, not all vendors publish benchmark results. We hope that NVIDIA can at publish the raw performance results of V100/A100 instead of the speed-up against V100 results as in products’ white papers. Of course the result will vary from vendor to vendor. But it is still helpful if the order of magnitude is correct.
As in this case, it leaves us second-guessing our installation.
Yes, the SMX4 cards are indeed part of HGX-A100 server.
For further clarification, the topology of PCIe ones is as follow:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity
GPU0 X PIX NODE NODE SYS SYS NODE NODE 0-23 0
GPU1 PIX X NODE NODE SYS SYS NODE NODE 0-23 0
GPU2 NODE NODE X PIX SYS SYS NODE NODE 0-23 0
GPU3 NODE NODE PIX X SYS SYS NODE NODE 0-23 0
GPU4 SYS SYS SYS SYS X PIX SYS SYS 24-47 1
GPU5 SYS SYS SYS SYS PIX X SYS SYS 24-47 1
mlx5_0 NODE NODE NODE NODE SYS SYS X PIX
mlx5_1 NODE NODE NODE NODE SYS SYS PIX X
The host is dual Xeon(R) Gold 6342.
Including GPU4 and GPU5 provides only marginal performance gain, perhaps because they need to traverse QPI to communicate with other GPUs. But we are not completely sure due to the black box nature of HPL-AI.