Hi, we are running hpc benchmark 21.6 on standalone machine with four 32 cores AMD Epyc and single A100, and it prints as bellow:
root@e6526e47a8ff:/workspace/hpl-linux-x86_64# ./xhpl HPL-dgx-a100-1N.dat
================================================================================
HPL-NVIDIA 1.0.0 – NVIDIA accelerated HPL benchmark – NVIDIA
HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 32032
NB : 288
PMAP : Row-major process mapping
P : 1
Q : 1
PFACT : Left
NBMIN : 4
NDIV : 2
RFACT : Left
BCAST : 1ring
DEPTH : 0
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
-
The matrix A is randomly generated for each test.
-
The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) -
The relative machine precision (eps) is taken to be 1.110223e-16
-
Computational tests pass if scaled residuals are less than 16.0
******** TESTING SYSTEM PARAMETERS ******** PARAM [UNITS] MIN MAX AVG ----- ------- --- --- ---
CPU :
CPU_BW [GB/s ] 17.0 17.0 17.0
CPU_FP [GFLPS]
NB = 32 56 56 56
NB = 64 106 106 106
NB = 128 188 188 188
NB = 256 242 242 242
NB = 512 337 337 337
PCIE (NVLINK on IBM) :
H2D_BW [GB/s ] 22.5 22.5 22.5
D2H_BW [GB/s ] 24.9 24.9 24.9
BID_BW [GB/s ] 31.4 31.4 31.4
CPU_BW concurrent with BID_BW :
CPU_BW [GB/s ] 14.4 14.4 14.4
BID_BW [GB/s ] 12.6 12.6 12.6
GPU :
GPU_BW [GB/s ] 1295 1295 1295
GPU_FP [GFLPS]
NB = 128 7743 7743 7743
NB = 256 15220 15220 15220
NB = 384 17996 17996 17996
NB = 512 17279 17279 17279
NB = 640 15540 15540 15540
NB = 768 14686 14686 14686
NB = 896 14496 14496 14496
NB = 1024 14559 14559 14559
NET :
PROC COL NET_BW [MB/s ]
8 B 78 78 78
64 B 647 647 647
512 B 4169 4169 4169
4 KB 12416 12416 12416
32 KB 15406 15406 15406
256 KB 25471 25471 25471
2048 KB 12430 12430 12430
16384 KB 11903 11903 11903
NET_LAT [ us ] 0.0 0.0 0.0
PROC ROW NET_BW [MB/s ]
8 B 102 102 102
64 B 763 763 763
512 B 5379 5379 5379
4 KB 24212 24212 24212
32 KB 31025 31025 31025
256 KB 35586 35586 35586
2048 KB 12804 12804 12804
16384 KB 11898 11898 11898
NET_LAT [ us ] 0.0 0.0 0.0
displaying Prog:%complete, N:columns, Time:seconds
iGF:instantaneous GF, GF:avg GF, GF_per: process GF
Per-Process Host Memory Estimate: 8.36 GB (MAX) 8.36 GB (MIN)
PCOL: 0 GPU_COLS: 32033 CPU_COLS: 0
2022-05-06 09:17:36.214
…
T/V N NB P Q Time Gflops
WR00L2L4 32032 288 1 1 16.60 3.020e+03
The 3TFLOPS is far from the 19.5TFLOPS as it is announced. As we know, it should be about 19.5*0.8=15.6TFLOPS in theory.
So can anyone tell me why? Thanks.