I’ve been benchmarking some new Sun Fire X4600 M2 servers using our normal software stack of PGI (with bundled MPICH and ACML) on RHEL 4.6 using HPL and the performance is awful.
The hardware is 8* Opteron 8356 with 64GB RAM, which has an Rpeak of 294.4 GFLOPS.
With PGI 7.2, we get a Rmax of 58.18 GFLOPS.
I tried out a different compiler and MPI, Sun StudioExpress 07/08 and Sun HPC Cluster Tools 8.0EA2, and Rmax with the same input file is 169.6 GFLOPS.
Compiler options were:
PGI: -tp barcelona-64 -fastsse -O3 -Munroll
Studio: -fast -xtarget=barcelona -m64 -xvector=simd
Any suggestions ?
Hmm, possibly it’s the MPI implementation, as the run time and CPU utilization is wildly different between processes:
top - 16:33:34 up 14 days, 16 min, 2 users, load average: 25.06, 25.54, 25.91
Tasks: 455 total, 26 running, 429 sleeping, 0 stopped, 0 zombie
Cpu(s): 71.9% us, 3.9% sy, 0.0% ni, 23.7% id, 0.0% wa, 0.0% hi, 0.5% si
Mem: 65910064k total, 55307004k used, 10603060k free, 272532k buffers
Swap: 20482832k total, 0k used, 20482832k free, 3703764k cached
PID USER PR NI %CPU TIME+ %MEM VIRT RES SHR S COMMAND
8731 atg 25 0 100 84:20.77 2.4 1603m 1.5g 1780 R xhpl
8623 atg 23 0 100 80:26.38 2.4 1564m 1.5g 1780 R xhpl
8839 atg 23 0 100 81:33.91 2.4 1564m 1.5g 1780 R xhpl
9055 atg 23 0 100 79:20.34 2.4 1564m 1.5g 1780 R xhpl
9136 atg 16 0 99 72:53.32 2.4 1607m 1.5g 1780 R xhpl
9271 atg 23 0 99 83:59.40 2.4 1564m 1.5g 1780 R xhpl
8503 atg 25 0 99 82:25.96 2.4 1602m 1.5g 1780 R xhpl
8785 atg 19 0 98 73:19.73 2.4 1598m 1.5g 1780 R xhpl
9163 atg 25 0 98 62:56.78 2.4 1597m 1.5g 1780 S xhpl
8812 atg 17 0 97 83:03.70 2.4 1574m 1.5g 1800 R xhpl
8758 atg 18 0 97 82:03.88 2.4 1598m 1.5g 1780 R xhpl
8677 atg 16 0 90 75:13.70 2.4 1572m 1.5g 1780 S xhpl
9244 atg 17 0 89 87:08.42 2.4 1574m 1.5g 1796 R xhpl
8893 atg 16 0 86 72:39.59 2.4 1572m 1.5g 1780 R xhpl
8920 atg 16 0 85 66:57.44 2.4 1596m 1.5g 1780 S xhpl
9001 atg 19 0 84 76:36.71 2.4 1599m 1.5g 1780 R xhpl
8496 atg 16 0 84 71:25.74 2.4 1597m 1.5g 1868 S xhpl
8596 atg 17 0 81 66:07.80 2.4 1574m 1.5g 1800 R xhpl
9325 atg 16 0 73 76:16.28 2.4 1572m 1.5g 1780 S xhpl
8704 atg 16 0 72 71:16.70 2.4 1606m 1.5g 1780 R xhpl
9190 atg 18 0 71 77:12.88 2.4 1599m 1.5g 1780 R xhpl
8947 atg 16 0 69 62:06.32 2.4 1596m 1.5g 1780 S xhpl
9109 atg 16 0 69 73:57.95 2.4 1572m 1.5g 1780 S xhpl
9028 atg 17 0 67 68:42.16 2.4 1574m 1.5g 1800 R xhpl
8569 atg 18 0 62 72:27.98 2.4 1598m 1.5g 1780 R xhpl
8974 atg 18 0 61 81:05.97 2.4 1599m 1.5g 1780 R xhpl
9217 atg 19 0 60 76:22.10 2.4 1599m 1.5g 1780 R xhpl
8534 atg 18 0 58 77:05.74 2.4 1598m 1.5g 1780 R xhpl
9298 atg 17 0 25 75:44.63 2.4 1569m 1.5g 1780 R xhpl
8866 atg 17 0 24 79:14.78 2.4 1568m 1.5g 1780 R xhpl
8650 atg 16 0 24 80:27.81 2.4 1568m 1.5g 1780 R xhpl
9082 atg 17 0 18 76:55.07 2.4 1569m 1.5g 1780 R xhpl
9876 atg 16 0 1 0:00.51 0.0 8732 1540 932 R top
9365 root 15 0 0 0:34.35 0.0 35396 11m 2608 S X
9699 gdm 16 0 0 0:10.12 0.0 124m 11m 6940 S gdmgreeter
1 root 16 0 0 0:02.71 0.0 4756 556 456 S init
2 root RT 0 0 0:00.13 0.0 0 0 0 S migration/0
Hi gormanly,
I would definitely try a different MPI implementation. The MPICH we ship uses basic TCP/IP and is meant for portability. For high performance applications, I would use the MPI version recommended by your interconnect vendor or one that optimized for your interconnect.
It’s not just a MPI problem, but that is part of it: further testing has given me
GFLOPS MPI compiler OS
58.18 MPICH 1.2.7 PGI 7.2 RHEL 4
98.83 OpenMPI 1.2.6 PGI 7.2 RHEL 4
122.6 OpenMPI 1.2.5 Studio 12 Solaris 10
123.4 OpenMPI 1.3pre Studio 12 RHEL 4
169.6 OpenMPI 1.3pre Studio Express RHEL 4
with the same input file.
Hi gormanly,
The vast majority of HPL’s time is spent in the Math library (specifically DGEMM) and the compiler has very little to do with the overall performance. Hence, you should next focus on math library used. It appears to me that Sun has a very good parallel math library. Can you try linking it with the PGI version? What happens if you use ACML with Sun? Do the ATLAS or GOTO Lapack libraries help?
On a side note, this interview (See: http://www.hpcwire.com/features/17886034.html) with PGI’s director, Doug Miles, posted on hpcwire.com might be of interest.