HPL

Massimiliano Fatica, do you plan on ever releasing the source code that goes along with this paper: http://portal.acm.org/citation.cfm?id=1513…FTOKEN=21112885

It would be very useful for our project, as we use a modified HPL in our application.

Also, have you considered trying out a mixed precision approach to HPL like found here: http://icl.cs.utk.edu/projectsfiles/iter-r…t_cs_06_580.pdf ?

We aren’t looking for a general purpose library. We’d be glad to put together a hackjob until a formal GPU linpack library comes out.

Ben

There is no current plan to formally release the software.

I am making the software available to universities and research centers, send me an email from your official email address and I will send you the source code.

Massimiliano

One should keep in mind that, while something like this might work for your application, using mixed precision violates the rules for an official HPL run.

Well I tried to send you a message but i had received this error message : This message can not be sent because the recipient has their personal messenger disabled or they are in a member group not allowed to use the personal messenger.

So, please, can you send me the code for your article ?

Regards

Fabien

Hello Mr Fatica

I’m a passionate amateur who runs many HPL tests on my computers, trying to best optimize my machines. I actually run it with GotoBLAS on Debian GNU/Linux amd64 (Intel Quad Core CPU). I have two Nvidia 8800GT and I would like to test them.

This would be very nice if you could send me the source code to “accelerate Linpack with CUDA”.

Best regards.

Even if you had the source, it would do you no good - HPL is a double precision benchmark and your cards don’t support double precision,

Heck, I have not thought about that. I just checked if my cards could support cuda. Never mind, thanks.

Dear Dr. Fatica,

First of all, thank you for sharing that wonderful paper about HPL, I really enjoyed reading it.

We currently have a 24 node cluster with Tesla S0107’s to evaluate running our Parallel Oil-Water-Gas Reservoir Simulator (POWERS) on GPUs.

At the same time we wanted to benchmark the system using your HPL version of the benchmark.

We setup your CUDA_pinned version of HPL for S1070 (gt200) on our system, basically we are using the exact same versions of the software recommended in CUDA_LINPACK_README.txt:

• Openmpi 1.2.5

• Intel Compiler 10.1.015

• Intel MKL 10.0.3

• CUDA 2.2 (cudatoolkit_2.2_linux_64_rhel4.7.run & NVIDIA-Linux-x86_64-185.18.36-pkg2.run)

Everything compiles fine.

The nodes are dual socket Quad Core Nehalem with 12 GB RAM, each node sees 2 GPUs coming from an S1070.

Runs are with 80% of memory.

When running on a node with 8 CPU cores and 2 GPUs (OMP_NUM_THREADS=4):

[chewbacca@superbeast078 CUDA_pinned]$ mpirun -np 2 -hostfile nodes ./run_linpack.CUDA_pinned

================================================================================

HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008

Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

================================================================================

An explanation of the input/output parameters follows:

T/V    : Wall time / encoded variant.

N      : The order of the coefficient matrix A.

NB     : The partitioning blocking factor.

P      : The number of process rows.

Q      : The number of process columns.

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   32032 

NB     :    1152 

PMAP   : Row-major process mapping

P      :       1 

Q      :       2 

PFACT  :    Left 

NBMIN  :       4 

NDIV   :       2 

RFACT  :    Left 

BCAST  :   1ring 

DEPTH  :       0 

SWAP   : Mix (threshold = 128)

L1     : transposed form

U      : transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be               1.110223e-16

- Computational tests pass if scaled residuals are less than                16.0

Assigning device 0  to process on node superbeast078 rank 0 

Assigning device 1  to process on node superbeast078 rank 1 

DTRSM split from environment variable 0.630000 

DGEMM split from environment variable 0.660000 

DTRSM split from environment variable 0.630000 

DGEMM split from environment variable 0.660000 

================================================================================

T/V                N    NB     P     Q               Time                 Gflops

--------------------------------------------------------------------------------

WR00L2L4       32032  1152     1     2             122.00              1.796e+02

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0044480 ...... PASSED

================================================================================

Finished      1 tests with the following results:

              1 tests completed and passed residual checks,

              0 tests completed and failed residual checks,

              0 tests skipped because of illegal input values.

--------------------------------------------------------------------------------

End of Tests.

================================================================================

Rpeak (theoretical) = 93 GFLOPS (8 CPU cores) + 154 GFLOPS (2 GPUs) = 247 GFLOPS

Rmax (actual) = 179.6 GFLOPS (73% efficiency)

So far the above tests are all running fine, however as soon as we try running on multiple GPUs from multiple nodes we get the below errors.

We still get the error even if we set the split to be as low as 0.50

We also tried reducing the value of N from 32032 to as low as 5000 and we still get the same error.

We have 24 nodes in the cluster and we tried the below run on multiple pairs of nodes just to make sure that it’s not a hardware problem, but we get the same error regardless of the pair being used.

I was wondering if you have encountered this before and any idea how to address it?

Your valuable feed back is highly appreciated.

Thank you for your help.

Sincerely,

Mohamad Sindi

Saudi Aramco

EXPEC Computer Center

High Performance Computing Group

Error:

[sindimo@superbeast078 CUDA_pinned]$ mpirun -np 4 -hostfile nodes ./run_linpack.CUDA_pinned

================================================================================

HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008

Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

================================================================================

An explanation of the input/output parameters follows:

T/V    : Wall time / encoded variant.

N      : The order of the coefficient matrix A.

NB     : The partitioning blocking factor.

P      : The number of process rows.

Q      : The number of process columns.

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :    5000 

NB     :     960 

PMAP   : Row-major process mapping

P      :       2 

Q      :       2 

PFACT  :    Left 

NBMIN  :       4 

NDIV   :       2 

RFACT  :    Left 

BCAST  :   1ring 

DEPTH  :       0 

SWAP   : Mix (threshold = 128)

L1     : transposed form

U      : transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be               1.110223e-16

- Computational tests pass if scaled residuals are less than                16.0

Assigning device 0  to process on node superbeast078 rank 0 

Assigning device 1  to process on node superbeast078 rank 2 

Assigning device 0  to process on node superbeast079 rank 1 

Assigning device 1  to process on node superbeast079 rank 3 

DTRSM split from environment variable 0.500000 

DTRSM split from environment variable 0.500000 

DGEMM split from environment variable 0.500000 

DGEMM split from environment variable 0.500000 

DTRSM split from environment variable 0.500000 

DTRSM split from environment variable 0.500000 

DGEMM split from environment variable 0.500000 

DGEMM split from environment variable 0.500000 

!!!! device access error (write B) 11

!!!! device access error (write C) 11

 ** On entry to DGEMM  parameter number 10 had an illegal value

!!!! device access error (write B) 11

!!!! device access error (write C) 11

 ** On entry to DGEMM  parameter number 10 had an illegal value

[superbeast078:27713] *** Process received signal ***

[superbeast078:27712] *** Process received signal ***

[superbeast078:27713] Signal: Segmentation fault (11)

[superbeast078:27713] Signal code: Invalid permissions (2)

[superbeast078:27713] Failing at address: 0x2aa95c3608

[superbeast079:27114] *** Process received signal ***

[superbeast079:27114] Signal: Segmentation fault (11)

[superbeast079:27114] Signal code: Invalid permissions (2)

[superbeast079:27114] Failing at address: 0x2aaa6307d0

[superbeast079:27115] *** Process received signal ***

[superbeast079:27115] Signal: Segmentation fault (11)

[superbeast079:27115] Signal code: Invalid permissions (2)

[superbeast079:27115] Failing at address: 0x2aa9b5b7d0

./run_linpack.CUDA_pinned: line 14: 27712 Segmentation fault      $HPL_DIR/bin/CUDA_pinned/xhpl

./run_linpack.CUDA_pinned: line 14: 27713 Segmentation fault      $HPL_DIR/bin/CUDA_pinned/xhpl

./run_linpack.CUDA_pinned: line 14: 27115 Segmentation fault      $HPL_DIR/bin/CUDA_pinned/xhpl

./run_linpack.CUDA_pinned: line 14: 27114 Segmentation fault      $HPL_DIR/bin/CUDA_pinned/xhpl

I have a similar issue, but even worse,I cannot even run on 1 GPU.

================================================================================

HPLinpack 2.0 – High-Performance Linpack benchmark – September 10, 2008

Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

================================================================================

An explanation of the input/output parameters follows:

T/V : Wall time / encoded variant.

N : The order of the coefficient matrix A.

NB : The partitioning blocking factor.

P : The number of process rows.

Q : The number of process columns.

Time : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 25000

NB : 384 512 640 768 896 960 1024 1152

PMAP : Row-major process mapping

P : 1

Q : 1

PFACT : Left

NBMIN : 2

NDIV : 2

RFACT : Left

BCAST : 1ring

DEPTH : 1

SWAP : Mix (threshold = 192)

L1 : no-transposed form

U : no-transposed form

EQUIL : yes

ALIGN : 8 double precision words


  • The matrix A is randomly generated for each test.

  • The following scaled residual check will be computed:

    ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
    
  • The relative machine precision (eps) is taken to be 1.110223e-16

  • Computational tests pass if scaled residuals are less than 16.0

[node001:14923] *** Process received signal ***

./run_linpack: line 17: 14923 Segmentation fault $HPL_DIR/bin/CUDA_pinned/xhpl


mpirun has exited due to process rank 0 with PID 14922 on

node node001 exiting without calling “finalize”. This may

have caused other processes in the application to be

terminated by signals sent by mpirun (as reported here).


Could anyone help on that.

Thanks in advance.

I have a problem when run HPL on GPU-Cluster.

================================================================================

HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008

Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

================================================================================

An explanation of the input/output parameters follows:

T/V    : Wall time / encoded variant.

N      : The order of the coefficient matrix A.

NB     : The partitioning blocking factor.

P      : The number of process rows.

Q      : The number of process columns.

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   15360 

NB     :     512 

PMAP   : Column-major process mapping

P      :       1 

Q      :       1 

PFACT  :    Left    Crout    Right 

NBMIN  :       2        4 

NDIV   :       2 

RFACT  :    Left    Crout    Right 

BCAST  :   1ring 

DEPTH  :       0 

SWAP   : Mix (threshold = 64)

L1     : transposed form

U      : transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be               1.110223e-16

- Computational tests pass if scaled residuals are less than                16.0

================================================================================

T/V                N    NB     P     Q               Time                 Gflops

--------------------------------------------------------------------------------

WC00L2L2       15360   512     1     1               3.21              7.534e+02

HPL ERROR from process # 0, on line 321 of function HPL_pdtest:

>>> Error code returned by solve is 1, skip**** <<<

HPL ERROR from process # 0, on line 321 of function HPL_pdtest:

Error code returned by solve is 1, skip**** <<<

How to solve this problem?

Hi all, I documented the procedure I did to get NVIDIA’s HPL working on both S1070 and S2050, I am getting similar efficiencies to what NVIDIA has published in their GTC2010 conference for 1 node runs (73% for C1060 and 63% for M2050). Here’s the link for the HOWTO, I hope you find it useful:

HOWTO - HPL on GPU

Thank you

Mohamad Sindi

Saudi Aramco

EXPEC Computer Center

High Performance Computing Group

Hi Mohamad, how did you obtain the code from Nvidia? We would also like to benchmark our cluster with HPL/CUDA (some nodes carry GPUs) but I have very hard time finding out whom to ask at Nvidia.

Regards

Michael