Hi,
We are encountering an very peculiar under-performance of H200nv_8 node in HPL-MxP benchmark test.
Since performance of other HPC benchmarks, e.g. HPL/HPCG/Gromacs/Lammps are according to published data, we are strongly believe that there might be a bug with the latest version of NGC HPC-Benchmarks container.
-
Spec:
- CPU: 2x Intel Xeon (8558)
- GPU: 8x H200-SMX5
- OS: CentOS 7
- Kernel: 3.10.0-1160.el7.x86_64
-
Topology:
-
Steps to reproduce:
- Container version: nvcr.io/nvidia/hpc-benchmarks:24.09
- Script:
singularity \ run --nv \ ./hpc-benchmarks_24.09.sif \ mpirun \ --np 1 \ /workspace/hpl-mxp.sh \ --gpu-affinity 0-23 \ --cpu-affinity 0 \ --mem-affinity 0 \ --nprow 1 \ --npcol 1 \ --nporder 0 \ --n 120000 \ --nb 2048
- Output:
****** HPL MxP Result ****** EPS . . . . . . . . . . . . . . . . . = 2.000000E-16 Threshold . . . . . . . . . . . . . . . . . = 1.600000E+01 ||Ax-b||_oo . . . . . . . . . . . . . . . . . = 6.577394E-14 ||A ||_oo . . . . . . . . . . . . . . . . . = 1.208605E+05 ||x ||_oo . . . . . . . . . . . . . . . . . = 1.674189E-05 ||b ||_oo . . . . . . . . . . . . . . . . . = 9.999956E-01 ||Ax-b||_oo / (EPS * (||A||_oo * ||x||_oo + ||b||_oo) * N) = 9.064482E-04 ...... PASSED N = 120000, NB = 2048, NPROW = 1, NPCOL = 1, SLOPPY-TYPE = 2 GFLOPS = 5.0041e+04, per GPU = 50041.11 LU GFLOPS = 4.5437e+05, per GPU = 454368.84 ****** HPL MxP Result ******
-
Other information:
- The same v24.9 container gave 50 Tflops/s with HPL benchmark. Thus we can rule out the possibility of hardware issue with our H200s.
- We did archived ~ 350 Tflops/s with GH200 node using similar parameters, and the H200 is expected to performance accordingly.
I am not aware of any dedicated forum for NGC. Please kindly move it to a appropriate one if you deems necessary.
Regards.