HPLinpack for CUDA Any interest?

In a very early foray into double precision about a year ago, I wrote a really nasty little wrapper function around CUBLAS dgemm and built and ran a couple of HPLinpack tests. It worked, but the results were less than spectacular, so I didn’t pursue it further, other promising myself I would do it"properly" for CUDA, when I had some free time. Massimo Fatica’s great paper on HPLinpack and CUDA only added further to my interest. I noticed from time to time there have been naïve posts here contain enquiries about running Linpack with CUDA (most recently this one,) but it seems nobody actually got all that far.

During the poor weather we “enjoyed” over Christmas, I decided to do a bit of instrumentation and profiling of the v2.0 of the netlib portable implementation and see where the low handing fruit is in terms of offloading parts of the calculation onto the GPU and trying to amortise PCI-e bus overheads wherever practicable. At the moment I have only offloaded the post panel factorization update code, so that the GPU executes dtrsm(), then host and device dgemm() calls are overlapped, each working on some fraction of the total columns in the update region. I think I have gone about as far as I can go in that direction, and I am beginning to get vaguely respectable looking numbers, but I wonder whether anyone else has actually tried this, and what their approach looks like? Most importantly, what sort of computational efficiency numbers have been achieved?

At the moment, about the best I can do on an 3.0GHz AMD Phenom II with a stock 896Mb GTX275 using GotoBlas2 and CUBLAS 2.3 looks something like this:

============================================================

====================

HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008

Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

============================================================

====================

An explanation of the input/output parameters follows:

T/V	: Wall time / encoded variant.

N	  : The order of the coefficient matrix A.

NB	 : The partitioning blocking factor.

P	  : The number of process rows.

Q	  : The number of process columns.

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N	  :   16000 

NB	 :	1024 

PMAP   : Column-major process mapping

P	  :	   1 

Q	  :	   1 

PFACT  :   Crout 

NBMIN  :	   8 

NDIV   :	   2 

RFACT  :   Crout 

BCAST  :  1ringM 

DEPTH  :	   0 

SWAP   : Mix (threshold = 192)

L1	 : transposed form

U	  : transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

	  ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be			   1.110223e-16

- Computational tests pass if scaled residuals are less than				16.0

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WC01C2C8	   16000  1024	 1	 1			  43.88			  6.223e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0030391 ...... PASSED

============================================================

====================

Finished	  1 tests with the following results:

			  1 tests completed and passed residual checks,

			  0 tests completed and failed residual checks,

			  0 tests skipped because of illegal input values.

--------------------------------------------------------------------------------

End of Tests.

============================================================

====================

That is only a shade over 50% computational efficiency, which isn’t all that fabulous. Memory size on the GPU seems to be a reasonable bottleneck to improving things, especially with the dgemm() call, because of the shape of the panel matrices and the total FLOPs in the operation, relative to the memory footprint of the product matrix.

Anyone else?

Looks interesting, a colleague of mine has a box with 3 Tesla cards… any chance to get your wrapper-sources to run some tests?

Unfortunately what I am using and what I described above isn’t a set of wrappers.

It is a reasonably invasive patch which makes some fairly large changes to a couple of HPLinpack source files, adds some new support code, requires modified makefiles, etc. Getting it into a redistributable form someone else could build would not be a small task. Then there is the issue of tuning. I have written some “Atlas like”, for want of a better expression, self-tuning routines which generate data for some python code to crunch down into a set of tuning factors for a given PCI-e bus, host cpu and card. It also would require a lot of work to get into shape for someone else to use.

The other question (with respect to the sort of multi gpu box you are talking about) is how well this adapts to multi-gpu. I have a single cpu, multi-gpu box and have tested it, but the scalability is poor. It is primarily a host BLAS based linpack with some CUDA acceleration, and it likes lots of CPU cores. It actually runs about as well on a cluster (even over ethernet), as it does on a single cpu-multi gpu box.

In the longer run I might make it available, but my major reason for posting this is to try and find out whether anybody outside of NVIDIA has actually worked on this, and if they have, what did they do. Double precision in CUDA is going to become a hot topic with Fermi on the way, and I see having a high performance double precision benchmark like HPL as being useful, both for those with new hardware and those with old hardware wanting to make a case for new hardware (put me in the latter category).

can you post the hpl2 setup file here? O:)

I totally agree.

Sorry, I am not sure what you are asking for. If you are asking for the HPL.dat file, you can see all the parameters in the output of the run above. If you are asking for the top level Makefile of the build, I can post it if you want, but prepare to be severely disappointed. That won’t make HPL magically use your GPU. There is rather a lot of new code involved in this. Without the other modifications, you will get a host cpu only version linked with CUBLAS.

EDIT: There was some information about multi-gpu here, but there were some subtle problems in the code which meant the computational efficiency was poor. Deleted.

There was a bug. I fixed it. I gained about 20GFlops for the single card case by doing so. Behold:

============================================================

====================

HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008

Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

============================================================

====================

An explanation of the input/output parameters follows:

T/V	: Wall time / encoded variant.

N	  : The order of the coefficient matrix A.

NB	 : The partitioning blocking factor.

P	  : The number of process rows.

Q	  : The number of process columns.

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N	  :   24000 

NB	 :	1024 

PMAP   : Column-major process mapping

P	  :	   1 

Q	  :	   1 

PFACT  :   Crout 

NBMIN  :	   8 

NDIV   :	   2 

RFACT  :   Crout 

BCAST  :  1ringM 

DEPTH  :	   0 

SWAP   : Mix (threshold = 192)

L1	 : transposed form

U	  : transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WC01C2C8	   24000  1024	 1	 1			 112.86			  8.167e+01

============================================================

====================

Finished	  1 tests with the following results:

			  1 tests completed without checking,

			  0 tests skipped because of illegal input values.

--------------------------------------------------------------------------------

End of Tests.

============================================================

====================

I am pretty sure that puts this code at about the same numbers/performance as M.Fatica was posting a while ago for a single GPU case, except here I am using a retail GTX275 with only 896Mb of ram.

As you only run on one GPU, do you keep the whole Matrix on the GPU or do you only export BLAS-Operations to the GPU?

The approach is somewhere between the two extremes. The only parts of factorization that are offloaded to the GPU are the post panel factorization updates. The CPU factorized panel is uploaded to the GPU and then DTRSM is run on the GPU to solve the matrix-matrix equation, with the resulting solution matrix kept in GPU memory. The DGEMM update is then run in both GPU memory and host memory in parallel, using an execution time model to work out the optimal split between host and GPU. Because the matrix size can be larger than the available GPU memory, the code is also capable of executing multiple parallel DGEMM calls on both the host and device to compute the whole matrix-matrix product. Otherwise, It is still the standard MPI based portable HPL code, so each host process will try and acquire its own GPU context, if one is available. Because of this, the code can run on more than one GPU. I have run it on a cluster as well as individual machines with MPI.

Thanks for the work and results avidday!

It would be very useful (and probably good for sales too) if there was a generally available HPL-type benchmark which utilized CUDA-capable devices in the system.

As much as I dislike synthetic benchmarks, HPL is a recognized measure of cluster performance. Today, if a customer requests a machine with a target HPL performance, there is no straightforward way to use GPUs to meet that mark. With the arrival of Fermi, this gap between realizable and documented performance for GPU systems will only grow.

For GPU-based systems to be accepted as a regular fixture in the HPC server room, their performance must be quantifiable through existing, generally accepted measures.

Actually, since posting this, I finally caved in and put together a multigpu version of the code. This version can distribute the post-panel factorisation update over as many GPUs as are available. Unfortunately I seem to be chasing a gcc 4.3 optimization bug which is preventing me from benchmarking the fast version of the host side code, which is a bit frustrating, but I am hoping to have some multigpu results to share quite soon.

Is the result vaild? because I notice the “(eps)” part is missing here.

Oh it is valid. I just got sick of waiting two odd minutes at a time for the verification code to finish. On these large matrices, the HPL code spends almost as more time generating the random matrix to factorize and verifying the results of the factorization than it does actual doing the run. Here is one, just for you:

avid@cuda:~/code/hpl-2.0$ date; bin/CUDA/xhpl 2>/dev/null;date

Wed Jan 27 19:07:08 EET 2010

============================================================

====================

HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008

Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

============================================================

====================

An explanation of the input/output parameters follows:

T/V	: Wall time / encoded variant.

N	  : The order of the coefficient matrix A.

NB	 : The partitioning blocking factor.

P	  : The number of process rows.

Q	  : The number of process columns.

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N	  :   24000 

NB	 :	1024 

PMAP   : Column-major process mapping

P	  :	   1 

Q	  :	   1 

PFACT  :   Crout 

NBMIN  :	   8 

NDIV   :	   2 

RFACT  :   Crout 

BCAST  :  1ringM 

DEPTH  :	   0 

SWAP   : Mix (threshold = 192)

L1	 : transposed form

U	  : transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

	  ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be			   1.110223e-16

- Computational tests pass if scaled residuals are less than				16.0

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WC01C2C8	   24000  1024	 1	 1			 119.59			  7.707e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0018796 ...... PASSED

============================================================

====================

Finished	  1 tests with the following results:

			  1 tests completed and passed residual checks,

			  0 tests completed and failed residual checks,

			  0 tests skipped because of illegal input values.

--------------------------------------------------------------------------------

End of Tests.

============================================================

====================

Wed Jan 27 19:10:32 EET 2010

Dragging this up again to vent… gcc 4.3 miscompiles part of my DGEMM code and costs me 30GFlops in the process! Look at the following:

Compiling an instrumented test harness with gcc optimization on (84 GFLOPs):

avid@cuda:~/code/pthreads_test$ make clean; CFLAGS="-Wall -O3 -m64 -fPIC -mtune=barcelona" make

rm -f pthread_test gputhread_support.o pthreads_test.o dcmtimer.o

cc -Wall -O3 -m64 -fPIC -mtune=barcelona	-c -I./ -I/opt/cuda-2.3/include gputhread_support.c -o gputhread_support.o

gputhread_support.c: In function ‘gpuMemPitch’:

gputhread_support.c:111: warning: suggest parentheses around arithmetic in operand of |

gputhread_support.c: In function ‘gpuThreadStartup’:

gputhread_support.c:540: warning: implicit declaration of function ‘pthread_mutexattr_settype’

cc -Wall -O3 -m64 -fPIC -mtune=barcelona	-c -I./ -I/opt/cuda-2.3/include pthreads_test.c -o pthreads_test.o

cc -Wall -O3 -m64 -fPIC -mtune=barcelona	-c -I./ -I/opt/cuda-2.3/include dcmtimer.c -o dcmtimer.o

cc -Wall -O3 -m64 -fPIC -mtune=barcelona	 -o pthread_test gputhread_support.o pthreads_test.o dcmtimer.o -L/opt/goto/GotoBLAS2/lib -lgoto2 -lgfortran -lpthread -L/opt/cuda-2.3/lib64 -lcublas -lcudart -lpthread -lcuda 

avid@cuda:~/code/pthreads_test$ ./pthread_test

1265024904.383380 {80a9e700}  Generating random matrices on host  in pthreads_test.c, line 583

1265024916.128986 {80a9e700}  Random matrices generation done  in pthreads_test.c, line 609

1265024916.132532 {80a9e700}  Signalling Identify  in pthreads_test.c, line 621

1265024916.132993 {7b528950} 0 GeForce GTX 275, 1460 MHz, 895 Mb, Compute Capability 1.3, Compute Prohibited mode in gputhread_support.c, line 248

1265024916.133016 {7ad27950} 1 GeForce GTX 275, 1460 MHz, 895 Mb, Compute Capability 1.3, Compute Exclusive mode in gputhread_support.c, line 248

1265024916.434011 {7ad27950} 1 GeForce GTX 275, Allocated 848 Mb in gputhread_support.c, line 311

1265024917.133569 {80a9e700}  Creating GEMM schdule  in pthreads_test.c, line 636

1265024917.133734 {80a9e700}  DGEMM() startup m=8192, n=8192, k=8192  in pthreads_test.c, line 673

1265024930.139468 {80a9e700}  DGEMM() complete GFLOPS = 8.454092e+01  in pthreads_test.c, line 686

1265024930.139483 {80a9e700}  Starting Host BLAS3 calculation  in pthreads_test.c, line 689

1265024954.925274 {80a9e700}  Host BLAS3 calculation complete GFLOPS = 4.436060e+01  in pthreads_test.c, line 698

1265024954.925297 {80a9e700}  Computing maximum difference between solutions in pthreads_test.c, line 700

1265024959.878137 {80a9e700}  Maximum difference = 1.350031197944190e-12  in pthreads_test.c, line 704

1265024959.878162 {80a9e700}  Signalling Exit  in pthreads_test.c, line 711

1265024959.964253 {80a9e700}  Executing shutdown  in pthreads_test.c, line 714

And the same harness rebuilt with gcc optimization off (114 GFLOPs):

avid@cuda:~/code/pthreads_test$ make clean; CFLAGS="-Wall -O0 -m64 -fPIC -mtune=barcelona" make

rm -f pthread_test gputhread_support.o pthreads_test.o dcmtimer.o

cc -Wall -O0 -m64 -fPIC -mtune=barcelona	-c -I./ -I/opt/cuda-2.3/include gputhread_support.c -o gputhread_support.o

gputhread_support.c: In function ‘gpuMemPitch’:

gputhread_support.c:111: warning: suggest parentheses around arithmetic in operand of |

gputhread_support.c: In function ‘gpuThreadStartup’:

gputhread_support.c:540: warning: implicit declaration of function ‘pthread_mutexattr_settype’

cc -Wall -O0 -m64 -fPIC -mtune=barcelona	-c -I./ -I/opt/cuda-2.3/include pthreads_test.c -o pthreads_test.o

cc -Wall -O0 -m64 -fPIC -mtune=barcelona	-c -I./ -I/opt/cuda-2.3/include dcmtimer.c -o dcmtimer.o

cc -Wall -O0 -m64 -fPIC -mtune=barcelona	 -o pthread_test gputhread_support.o pthreads_test.o dcmtimer.o -L/opt/goto/GotoBLAS2/lib -lgoto2 -lgfortran -lpthread -L/opt/cuda-2.3/lib64 -lcublas -lcudart -lpthread -lcuda 

avid@cuda:~/code/pthreads_test$ ./pthread_test 

1265025553.926115 {28781700}  Generating random matrices on host  in pthreads_test.c, line 583

1265025564.827765 {28781700}  Random matrices generation done  in pthreads_test.c, line 609

1265025564.831267 {28781700}  Signalling Identify  in pthreads_test.c, line 621

1265025564.831741 {2320b950} 0 GeForce GTX 275, 1460 MHz, 895 Mb, Compute Capability 1.3, Compute Prohibited mode in gputhread_support.c, line 248

1265025564.831765 {22a0a950} 1 GeForce GTX 275, 1460 MHz, 895 Mb, Compute Capability 1.3, Compute Exclusive mode in gputhread_support.c, line 248

1265025565.094093 {22a0a950} 1 GeForce GTX 275, Allocated 848 Mb in gputhread_support.c, line 311

1265025565.832355 {28781700}  Creating GEMM schdule  in pthreads_test.c, line 636

1265025565.832537 {28781700}  DGEMM() startup m=8192, n=8192, k=8192  in pthreads_test.c, line 673

1265025575.474050 {28781700}  DGEMM() complete GFLOPS = 1.140399e+02  in pthreads_test.c, line 686

1265025575.474062 {28781700}  Starting Host BLAS3 calculation  in pthreads_test.c, line 689

1265025600.242516 {28781700}  Host BLAS3 calculation complete GFLOPS = 4.439164e+01  in pthreads_test.c, line 698

1265025600.242540 {28781700}  Computing maximum difference between solutions in pthreads_test.c, line 700

1265025606.150005 {28781700}  Maximum difference = 1.364242052659392e-12  in pthreads_test.c, line 704

1265025606.150030 {28781700}  Signalling Exit  in pthreads_test.c, line 711

1265025606.237560 {28781700}  Executing shutdown  in pthreads_test.c, line 714

Hi,
Any chance you can post the entire code you’re using? I dont remember if you already answered it, but will
this test for example a multi-GPU environment with multi S1070s??

Any chance you can post a jump-start explaination as to what needed to be installed, done and what to do
in order to do those performance tests???

thanks
eyal

I probably will make it available, but not yet. I have some things I want to publish first. There is a lot of code, so I doubt I will post it here. But I will make it available for download somewhere. The odd part is that what is miscompiling is some pretty simple integer calculations which are part of the scheduling algorithm for the gemm kernel.

It should. The code contains a pthreads based multi-gpu framework which can do a number of useful things, including out of core gemm() for smaller memory cards where everything can’t fit into GPU memory, and memory balancing where cards with different memory sizes or available memory can be used optimally. It certainly works for various hardware I have been able to try it on so far (unfortunately that doesn’t incude an S1070 yet).

Nothing exotic. Linux, the host blas and MPI of your choice, CUDA 2.3 or 3.0, and a GT200 or Fermi based GPU.

+1 Really interested to see the code :teehee:

Kazushige Goto had leave TACC, does anyone know where does he join? His GotoBLAS is very fast.

+2 more really interested to see the code