HPL benchmark on A100(40GB PCIe)

Hi, we are running hpc benchmark 21.6 on standalone machine with four 32 cores AMD Epyc and single A100, and it prints as bellow:

root@e6526e47a8ff:/workspace/hpl-linux-x86_64# ./xhpl HPL-dgx-a100-1N.dat

================================================================================
HPL-NVIDIA 1.0.0 – NVIDIA accelerated HPL benchmark – NVIDIA

HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 32032
NB : 288
PMAP : Row-major process mapping
P : 1
Q : 1
PFACT : Left
NBMIN : 4
NDIV : 2
RFACT : Left
BCAST : 1ring
DEPTH : 0
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words


  • The matrix A is randomly generated for each test.

  • The following scaled residual check will be computed:
    ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

  • The relative machine precision (eps) is taken to be 1.110223e-16

  • Computational tests pass if scaled residuals are less than 16.0

      ******** TESTING SYSTEM PARAMETERS ********
      PARAM   [UNITS]         MIN     MAX     AVG
      -----   -------         ---     ---     ---
    

CPU :
CPU_BW [GB/s ] 17.0 17.0 17.0
CPU_FP [GFLPS]
NB = 32 56 56 56
NB = 64 106 106 106
NB = 128 188 188 188
NB = 256 242 242 242
NB = 512 337 337 337
PCIE (NVLINK on IBM) :
H2D_BW [GB/s ] 22.5 22.5 22.5
D2H_BW [GB/s ] 24.9 24.9 24.9
BID_BW [GB/s ] 31.4 31.4 31.4
CPU_BW concurrent with BID_BW :
CPU_BW [GB/s ] 14.4 14.4 14.4
BID_BW [GB/s ] 12.6 12.6 12.6
GPU :
GPU_BW [GB/s ] 1295 1295 1295
GPU_FP [GFLPS]
NB = 128 7743 7743 7743
NB = 256 15220 15220 15220
NB = 384 17996 17996 17996
NB = 512 17279 17279 17279
NB = 640 15540 15540 15540
NB = 768 14686 14686 14686
NB = 896 14496 14496 14496
NB = 1024 14559 14559 14559
NET :
PROC COL NET_BW [MB/s ]
8 B 78 78 78
64 B 647 647 647
512 B 4169 4169 4169
4 KB 12416 12416 12416
32 KB 15406 15406 15406
256 KB 25471 25471 25471
2048 KB 12430 12430 12430
16384 KB 11903 11903 11903
NET_LAT [ us ] 0.0 0.0 0.0

    PROC ROW NET_BW [MB/s ]
                 8 B         102     102     102
                64 B         763     763     763
               512 B        5379    5379    5379
                 4 KB       24212   24212   24212
                32 KB       31025   31025   31025
               256 KB       35586   35586   35586
              2048 KB       12804   12804   12804
             16384 KB       11898   11898   11898
    NET_LAT [ us  ]         0.0     0.0     0.0

displaying Prog:%complete, N:columns, Time:seconds
iGF:instantaneous GF, GF:avg GF, GF_per: process GF

Per-Process Host Memory Estimate: 8.36 GB (MAX) 8.36 GB (MIN)

PCOL: 0 GPU_COLS: 32033 CPU_COLS: 0
2022-05-06 09:17:36.214

T/V N NB P Q Time Gflops

WR00L2L4 32032 288 1 1 16.60 3.020e+03

The 3TFLOPS is far from the 19.5TFLOPS as it is announced. As we know, it should be about 19.5*0.8=15.6TFLOPS in theory.

So can anyone tell me why? Thanks.

Your N is too small.