There is no current plan to formally release the software.

I am making the software available to universities and research centers, send me an email from your official email address and I will send you the source code.

One should keep in mind that, while something like this might work for your application, using mixed precision violates the rules for an official HPL run.

Well I tried to send you a message but i had received this error message : This message can not be sent because the recipient has their personal messenger disabled or they are in a member group not allowed to use the personal messenger.

So, please, can you send me the code for your article ?

I’m a passionate amateur who runs many HPL tests on my computers, trying to best optimize my machines. I actually run it with GotoBLAS on Debian GNU/Linux amd64 (Intel Quad Core CPU). I have two Nvidia 8800GT and I would like to test them.

This would be very nice if you could send me the source code to “accelerate Linpack with CUDA”.

First of all, thank you for sharing that wonderful paper about HPL, I really enjoyed reading it.

We currently have a 24 node cluster with Tesla S0107’s to evaluate running our Parallel Oil-Water-Gas Reservoir Simulator (POWERS) on GPUs.

At the same time we wanted to benchmark the system using your HPL version of the benchmark.

We setup your CUDA_pinned version of HPL for S1070 (gt200) on our system, basically we are using the exact same versions of the software recommended in CUDA_LINPACK_README.txt:

â€¢ Openmpi 1.2.5

â€¢ Intel Compiler 10.1.015

â€¢ Intel MKL 10.0.3

â€¢ CUDA 2.2 (cudatoolkit_2.2_linux_64_rhel4.7.run & NVIDIA-Linux-x86_64-185.18.36-pkg2.run)

Everything compiles fine.

The nodes are dual socket Quad Core Nehalem with 12 GB RAM, each node sees 2 GPUs coming from an S1070.

Runs are with 80% of memory.

When running on a node with 8 CPU cores and 2 GPUs (OMP_NUM_THREADS=4):

[chewbacca@superbeast078 CUDA_pinned]$ mpirun -np 2 -hostfile nodes ./run_linpack.CUDA_pinned
================================================================================
HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 32032
NB : 1152
PMAP : Row-major process mapping
P : 1
Q : 2
PFACT : Left
NBMIN : 4
NDIV : 2
RFACT : Left
BCAST : 1ring
DEPTH : 0
SWAP : Mix (threshold = 128)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
Assigning device 0 to process on node superbeast078 rank 0
Assigning device 1 to process on node superbeast078 rank 1
DTRSM split from environment variable 0.630000
DGEMM split from environment variable 0.660000
DTRSM split from environment variable 0.630000
DGEMM split from environment variable 0.660000
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR00L2L4 32032 1152 1 2 122.00 1.796e+02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0044480 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================

So far the above tests are all running fine, however as soon as we try running on multiple GPUs from multiple nodes we get the below errors.

We still get the error even if we set the split to be as low as 0.50

We also tried reducing the value of N from 32032 to as low as 5000 and we still get the same error.

We have 24 nodes in the cluster and we tried the below run on multiple pairs of nodes just to make sure that itâ€™s not a hardware problem, but we get the same error regardless of the pair being used.

I was wondering if you have encountered this before and any idea how to address it?

Your valuable feed back is highly appreciated.

Thank you for your help.

Sincerely,

Mohamad Sindi

Saudi Aramco

EXPEC Computer Center

High Performance Computing Group

Error:

[sindimo@superbeast078 CUDA_pinned]$ mpirun -np 4 -hostfile nodes ./run_linpack.CUDA_pinned
================================================================================
HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 5000
NB : 960
PMAP : Row-major process mapping
P : 2
Q : 2
PFACT : Left
NBMIN : 4
NDIV : 2
RFACT : Left
BCAST : 1ring
DEPTH : 0
SWAP : Mix (threshold = 128)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
Assigning device 0 to process on node superbeast078 rank 0
Assigning device 1 to process on node superbeast078 rank 2
Assigning device 0 to process on node superbeast079 rank 1
Assigning device 1 to process on node superbeast079 rank 3
DTRSM split from environment variable 0.500000
DTRSM split from environment variable 0.500000
DGEMM split from environment variable 0.500000
DGEMM split from environment variable 0.500000
DTRSM split from environment variable 0.500000
DTRSM split from environment variable 0.500000
DGEMM split from environment variable 0.500000
DGEMM split from environment variable 0.500000
!!!! device access error (write B) 11
!!!! device access error (write C) 11
** On entry to DGEMM parameter number 10 had an illegal value
!!!! device access error (write B) 11
!!!! device access error (write C) 11
** On entry to DGEMM parameter number 10 had an illegal value
[superbeast078:27713] *** Process received signal ***
[superbeast078:27712] *** Process received signal ***
[superbeast078:27713] Signal: Segmentation fault (11)
[superbeast078:27713] Signal code: Invalid permissions (2)
[superbeast078:27713] Failing at address: 0x2aa95c3608
[superbeast079:27114] *** Process received signal ***
[superbeast079:27114] Signal: Segmentation fault (11)
[superbeast079:27114] Signal code: Invalid permissions (2)
[superbeast079:27114] Failing at address: 0x2aaa6307d0
[superbeast079:27115] *** Process received signal ***
[superbeast079:27115] Signal: Segmentation fault (11)
[superbeast079:27115] Signal code: Invalid permissions (2)
[superbeast079:27115] Failing at address: 0x2aa9b5b7d0
./run_linpack.CUDA_pinned: line 14: 27712 Segmentation fault $HPL_DIR/bin/CUDA_pinned/xhpl
./run_linpack.CUDA_pinned: line 14: 27713 Segmentation fault $HPL_DIR/bin/CUDA_pinned/xhpl
./run_linpack.CUDA_pinned: line 14: 27115 Segmentation fault $HPL_DIR/bin/CUDA_pinned/xhpl
./run_linpack.CUDA_pinned: line 14: 27114 Segmentation fault $HPL_DIR/bin/CUDA_pinned/xhpl

================================================================================
HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 15360
NB : 512
PMAP : Column-major process mapping
P : 1
Q : 1
PFACT : Left Crout Right
NBMIN : 2 4
NDIV : 2
RFACT : Left Crout Right
BCAST : 1ring
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC00L2L2 15360 512 1 1 3.21 7.534e+02
HPL ERROR from process # 0, on line 321 of function HPL_pdtest:
>>> Error code returned by solve is 1, skip**** <<<

HPL ERROR from process # 0, on line 321 of function HPL_pdtest:

Hi all, I documented the procedure I did to get NVIDIA’s HPL working on both S1070 and S2050, I am getting similar efficiencies to what NVIDIA has published in their GTC2010 conference for 1 node runs (73% for C1060 and 63% for M2050). Here’s the link for the HOWTO, I hope you find it useful:

Hi Mohamad, how did you obtain the code from Nvidia? We would also like to benchmark our cluster with HPL/CUDA (some nodes carry GPUs) but I have very hard time finding out whom to ask at Nvidia.