Terrible HPL performance

I’ve been benchmarking some new Sun Fire X4600 M2 servers using our normal software stack of PGI (with bundled MPICH and ACML) on RHEL 4.6 using HPL and the performance is awful.

The hardware is 8* Opteron 8356 with 64GB RAM, which has an Rpeak of 294.4 GFLOPS.

With PGI 7.2, we get a Rmax of 58.18 GFLOPS.

I tried out a different compiler and MPI, Sun StudioExpress 07/08 and Sun HPC Cluster Tools 8.0EA2, and Rmax with the same input file is 169.6 GFLOPS.

Compiler options were:
PGI: -tp barcelona-64 -fastsse -O3 -Munroll
Studio: -fast -xtarget=barcelona -m64 -xvector=simd

Any suggestions ?

Hmm, possibly it’s the MPI implementation, as the run time and CPU utilization is wildly different between processes:

top - 16:33:34 up 14 days, 16 min,  2 users,  load average: 25.06, 25.54, 25.91
Tasks: 455 total,  26 running, 429 sleeping,   0 stopped,   0 zombie
Cpu(s): 71.9% us,  3.9% sy,  0.0% ni, 23.7% id,  0.0% wa,  0.0% hi,  0.5% si
Mem:  65910064k total, 55307004k used, 10603060k free,   272532k buffers
Swap: 20482832k total,        0k used, 20482832k free,  3703764k cached

  PID USER      PR  NI %CPU    TIME+  %MEM  VIRT  RES  SHR S COMMAND            
 8731 atg       25   0  100  84:20.77  2.4 1603m 1.5g 1780 R xhpl               
 8623 atg       23   0  100  80:26.38  2.4 1564m 1.5g 1780 R xhpl               
 8839 atg       23   0  100  81:33.91  2.4 1564m 1.5g 1780 R xhpl               
 9055 atg       23   0  100  79:20.34  2.4 1564m 1.5g 1780 R xhpl               
 9136 atg       16   0   99  72:53.32  2.4 1607m 1.5g 1780 R xhpl               
 9271 atg       23   0   99  83:59.40  2.4 1564m 1.5g 1780 R xhpl               
 8503 atg       25   0   99  82:25.96  2.4 1602m 1.5g 1780 R xhpl               
 8785 atg       19   0   98  73:19.73  2.4 1598m 1.5g 1780 R xhpl               
 9163 atg       25   0   98  62:56.78  2.4 1597m 1.5g 1780 S xhpl               
 8812 atg       17   0   97  83:03.70  2.4 1574m 1.5g 1800 R xhpl               
 8758 atg       18   0   97  82:03.88  2.4 1598m 1.5g 1780 R xhpl               
 8677 atg       16   0   90  75:13.70  2.4 1572m 1.5g 1780 S xhpl               
 9244 atg       17   0   89  87:08.42  2.4 1574m 1.5g 1796 R xhpl               
 8893 atg       16   0   86  72:39.59  2.4 1572m 1.5g 1780 R xhpl               
 8920 atg       16   0   85  66:57.44  2.4 1596m 1.5g 1780 S xhpl               
 9001 atg       19   0   84  76:36.71  2.4 1599m 1.5g 1780 R xhpl               
 8496 atg       16   0   84  71:25.74  2.4 1597m 1.5g 1868 S xhpl               
 8596 atg       17   0   81  66:07.80  2.4 1574m 1.5g 1800 R xhpl               
 9325 atg       16   0   73  76:16.28  2.4 1572m 1.5g 1780 S xhpl               
 8704 atg       16   0   72  71:16.70  2.4 1606m 1.5g 1780 R xhpl               
 9190 atg       18   0   71  77:12.88  2.4 1599m 1.5g 1780 R xhpl               
 8947 atg       16   0   69  62:06.32  2.4 1596m 1.5g 1780 S xhpl               
 9109 atg       16   0   69  73:57.95  2.4 1572m 1.5g 1780 S xhpl               
 9028 atg       17   0   67  68:42.16  2.4 1574m 1.5g 1800 R xhpl               
 8569 atg       18   0   62  72:27.98  2.4 1598m 1.5g 1780 R xhpl               
 8974 atg       18   0   61  81:05.97  2.4 1599m 1.5g 1780 R xhpl               
 9217 atg       19   0   60  76:22.10  2.4 1599m 1.5g 1780 R xhpl               
 8534 atg       18   0   58  77:05.74  2.4 1598m 1.5g 1780 R xhpl               
 9298 atg       17   0   25  75:44.63  2.4 1569m 1.5g 1780 R xhpl               
 8866 atg       17   0   24  79:14.78  2.4 1568m 1.5g 1780 R xhpl               
 8650 atg       16   0   24  80:27.81  2.4 1568m 1.5g 1780 R xhpl               
 9082 atg       17   0   18  76:55.07  2.4 1569m 1.5g 1780 R xhpl               
 9876 atg       16   0    1   0:00.51  0.0  8732 1540  932 R top                
 9365 root      15   0    0   0:34.35  0.0 35396  11m 2608 S X                  
 9699 gdm       16   0    0   0:10.12  0.0  124m  11m 6940 S gdmgreeter         
    1 root      16   0    0   0:02.71  0.0  4756  556  456 S init               
    2 root      RT   0    0   0:00.13  0.0     0    0    0 S migration/0

Hi gormanly,

I would definitely try a different MPI implementation. The MPICH we ship uses basic TCP/IP and is meant for portability. For high performance applications, I would use the MPI version recommended by your interconnect vendor or one that optimized for your interconnect.

  • Mat

It’s not just a MPI problem, but that is part of it: further testing has given me

GFLOPS         MPI              compiler         OS

 58.18         MPICH 1.2.7      PGI 7.2          RHEL 4
 98.83         OpenMPI 1.2.6    PGI 7.2          RHEL 4
122.6          OpenMPI 1.2.5    Studio 12        Solaris 10
123.4          OpenMPI 1.3pre   Studio 12        RHEL 4
169.6          OpenMPI 1.3pre   Studio Express   RHEL 4

with the same input file.

Hi gormanly,

The vast majority of HPL’s time is spent in the Math library (specifically DGEMM) and the compiler has very little to do with the overall performance. Hence, you should next focus on math library used. It appears to me that Sun has a very good parallel math library. Can you try linking it with the PGI version? What happens if you use ACML with Sun? Do the ATLAS or GOTO Lapack libraries help?

On a side note, this interview (See: http://www.hpcwire.com/features/17886034.html) with PGI’s director, Doug Miles, posted on hpcwire.com might be of interest.

  • Mat