Double Precision LU

Anyone feel like taking a shot in the dark as to the possible performance of a 2000x2000 LU double precision factorization?

Or would I need to be looking into the 8000 range as seen in V. Volkov’s single precision factorization work (http://forums.nvidia.com/index.php?showtopic=89084&hl=volkov)?

Ben

You don’t need to take too much of a guess. Just read this.

It gives some pretty indicative numbers for GPU and CPU+GPU HPL, which is effectively what you are interested in (I am presuming you are talking full rather than sparse factorization).

Maybe this can give you an idea between differences of single and double:

http://www3.uji.es/~figual/files/Papers/chol_LU_TR.pdf

Has anyone played with converting Volkov’s code to doubles?

Ben

(Edit: Not meaning to be rude and ignore your comments.

avidday: That very well may be what we end up using. Just checking around.
figual: Ah, I see. Good to know the performance picks up pretty quickly around the Ns we’re considering)

Modifying Vasily’s code, it is probably the best option.

These are the results from HPL.

mpirun -np 1 ./run_linpack 

============================================================

====================

HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008

Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

============================================================

====================

An explanation of the input/output parameters follows:

T/V	: Wall time / encoded variant.

N	  : The order of the coefficient matrix A.

NB	 : The partitioning blocking factor.

P	  : The number of process rows.

Q	  : The number of process columns.

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N	  :	2000 

NB	 :	1280	 1152	  960	  896	  768	  640	  512	  384 

			 256	  128 

PMAP   : Row-major process mapping

P	  :	   1 

Q	  :	   1 

PFACT  :	Left 

NBMIN  :	   2 

NDIV   :	   2 

RFACT  :	Left 

BCAST  :   1ring 

DEPTH  :	   1 

SWAP   : Mix (threshold = 256)

L1	 : no-transposed form

U	  : no-transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

	  ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be			   1.110223e-16

- Computational tests pass if scaled residuals are less than				16.0

Assigning device 0  to process on node  rank 0 

DTRSM split from environment variable 0.520000 

DGEMM split from environment variable 0.655000 

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WR10L2L2		2000  1280	 1	 1			   0.44			  1.223e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0077577 ...... PASSED

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WR10L2L2		2000  1152	 1	 1			   0.40			  1.345e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0067703 ...... PASSED

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WR10L2L2		2000   960	 1	 1			   0.30			  1.755e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0061134 ...... PASSED

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WR10L2L2		2000   896	 1	 1			   0.31			  1.746e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0075184 ...... PASSED

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WR10L2L2		2000   768	 1	 1			   0.28			  1.889e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0071333 ...... PASSED

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WR10L2L2		2000   640	 1	 1			   0.27			  1.942e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0061972 ...... PASSED

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WR10L2L2		2000   512	 1	 1			   0.26			  2.080e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0056605 ...... PASSED

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WR10L2L2		2000   384	 1	 1			   0.26			  2.074e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0069768 ...... PASSED

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WR10L2L2		2000   256	 1	 1			   0.26			  2.049e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0073184 ...... PASSED

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WR10L2L2		2000   128	 1	 1			   0.32			  1.680e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0053558 ...... PASSED

============================================================

====================

Finished	 10 tests with the following results:

			 10 tests completed and passed residual checks,

			  0 tests completed and failed residual checks,

			  0 tests skipped because of illegal input values.

--------------------------------------------------------------------------------

End of Tests.

============================================================

====================

Ah, very nice. Thanks for the info Massimiliano (pardon my ignorance, but is there an acceptable short way to say your name?)

Ben