[HPC-Benchmarks] Discrepancy between A100 PCIe and A100 SMX4

vitduck · January 26, 2022, 12:29pm

Hi,

We are running hpc benchmark 21.4 on out GPU servers.
There is a pronounced discrepancy in term of HPL-AI performance as follow:
4 x A100 PCIe: ~ 110 TFlops
4 x A100 SMX4: ~ 485 TFlops

The former result is at least consistent with a published benchmark results from PugetSystem:
https://www.pugetsystems.com/labs/hpc/Outstanding-Performance-of-NVIDIA-A100-PCIe-on-HPL-HPL-AI-HPCG-Benchmarks-2149/#HPL-AI

Could you kindly confirm whether such difference can be attributed to NVLink ?

Regards.

mnicely · January 26, 2022, 8:57pm

Performance results for HPC Benchmarks (i.e., HPL, HPL-AI, HPCG) are highly dependent on system topology.

It’s hard to provide much information without details of each system.

I can say this, HPL-AI is memory bound. Assuming the SXM4 system is a DGX A100 with NVLink’s 600GB/s and PCI-e systems is bottlenecks by the Gen4 64GB/s, I’m not surprised by the results.

Note, I’m not aware of any official testing of PCIe cards with HPC Benchmarks.

vitduck · January 27, 2022, 12:17am

Matt, thanks for your comments.

When we deliver clusters to our customers, HPC benchmarks are routinely used to validate the installation.
Unfortunately, not all vendors publish benchmark results. We hope that NVIDIA can at publish the raw performance results of V100/A100 instead of the speed-up against V100 results as in products’ white papers. Of course the result will vary from vendor to vendor. But it is still helpful if the order of magnitude is correct.
As in this case, it leaves us second-guessing our installation.

Yes, the SMX4 cards are indeed part of HGX-A100 server.
For further clarification, the topology of PCIe ones is as follow:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    mlx5_0  mlx5_1  CPU Affinity    NUMA Affinity
GPU0     X      PIX     NODE    NODE    SYS     SYS     NODE    NODE    0-23    0
GPU1    PIX      X      NODE    NODE    SYS     SYS     NODE    NODE    0-23    0
GPU2    NODE    NODE     X      PIX     SYS     SYS     NODE    NODE    0-23    0
GPU3    NODE    NODE    PIX      X      SYS     SYS     NODE    NODE    0-23    0
GPU4    SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     24-47   1
GPU5    SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     24-47   1
mlx5_0  NODE    NODE    NODE    NODE    SYS     SYS      X      PIX
mlx5_1  NODE    NODE    NODE    NODE    SYS     SYS     PIX      X

The host is dual Xeon(R) Gold 6342.

Including GPU4 and GPU5 provides only marginal performance gain, perhaps because they need to traverse QPI to communicate with other GPUs. But we are not completely sure due to the black box nature of HPL-AI.

If you have further insights, please let us know.

Topic		Replies	Views
A100 PCIe HPL-AI poor performance NGC GPU Cloud hw , cuda	0	488	January 27, 2022
A100 PCIe HPL-AI poor performance GPU - Hardware hw , cuda	1	931	January 27, 2022
HPL benchmark on A100(40GB PCIe) GPU-Accelerated Libraries cuda	1	1415	May 8, 2022
HPL test using NVLINK CUDA Setup and Installation cuda , hpc	1	1405	March 11, 2023
Compare the response time differences between 4xA100 and 8xH100 DGX User Forum	0	449	December 28, 2023
Peculiar Performance of H200 in HPL-MxP Benchmark GPU-Accelerated Libraries	0	268	November 24, 2024
H100 HPL results Container: HPC	0	414	June 29, 2024
HPL-AI Now Runs 2x Faster on NVIDIA DGX A100 Technical Blog	0	587	April 28, 2021
Run HPL on 4x A100 CUDA Programming and Performance	3	3099	July 17, 2021
Run hpc_benchmark23.10 HPL with v100GPU GPU-Accelerated Libraries hpc , benchmarks , hpc-x	3	1590	January 25, 2024

[HPC-Benchmarks] Discrepancy between A100 PCIe and A100 SMX4

Related topics