[HPC-Benchmarks] Discrepancy between A100 PCIe and A100 SMX4

Hi,

We are running hpc benchmark 21.4 on out GPU servers.
There is a pronounced discrepancy in term of HPL-AI performance as follow:
4 x A100 PCIe: ~ 110 TFlops
4 x A100 SMX4: ~ 485 TFlops

The former result is at least consistent with a published benchmark results from PugetSystem:
https://www.pugetsystems.com/labs/hpc/Outstanding-Performance-of-NVIDIA-A100-PCIe-on-HPL-HPL-AI-HPCG-Benchmarks-2149/#HPL-AI

Could you kindly confirm whether such difference can be attributed to NVLink ?

Regards.

Performance results for HPC Benchmarks (i.e., HPL, HPL-AI, HPCG) are highly dependent on system topology.

It’s hard to provide much information without details of each system.

I can say this, HPL-AI is memory bound. Assuming the SXM4 system is a DGX A100 with NVLink’s 600GB/s and PCI-e systems is bottlenecks by the Gen4 64GB/s, I’m not surprised by the results.

Note, I’m not aware of any official testing of PCIe cards with HPC Benchmarks.

Matt, thanks for your comments.

When we deliver clusters to our customers, HPC benchmarks are routinely used to validate the installation.
Unfortunately, not all vendors publish benchmark results. We hope that NVIDIA can at publish the raw performance results of V100/A100 instead of the speed-up against V100 results as in products’ white papers. Of course the result will vary from vendor to vendor. But it is still helpful if the order of magnitude is correct.
As in this case, it leaves us second-guessing our installation.

Yes, the SMX4 cards are indeed part of HGX-A100 server.
For further clarification, the topology of PCIe ones is as follow:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    mlx5_0  mlx5_1  CPU Affinity    NUMA Affinity
GPU0     X      PIX     NODE    NODE    SYS     SYS     NODE    NODE    0-23    0
GPU1    PIX      X      NODE    NODE    SYS     SYS     NODE    NODE    0-23    0
GPU2    NODE    NODE     X      PIX     SYS     SYS     NODE    NODE    0-23    0
GPU3    NODE    NODE    PIX      X      SYS     SYS     NODE    NODE    0-23    0
GPU4    SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     24-47   1
GPU5    SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     24-47   1
mlx5_0  NODE    NODE    NODE    NODE    SYS     SYS      X      PIX
mlx5_1  NODE    NODE    NODE    NODE    SYS     SYS     PIX      X

The host is dual Xeon(R) Gold 6342.

Including GPU4 and GPU5 provides only marginal performance gain, perhaps because they need to traverse QPI to communicate with other GPUs. But we are not completely sure due to the black box nature of HPL-AI.

If you have further insights, please let us know.