In a very early foray into double precision about a year ago, I wrote a really nasty little wrapper function around CUBLAS dgemm and built and ran a couple of HPLinpack tests. It worked, but the results were less than spectacular, so I didn’t pursue it further, other promising myself I would do it"properly" for CUDA, when I had some free time. Massimo Fatica’s great paper on HPLinpack and CUDA only added further to my interest. I noticed from time to time there have been naïve posts here contain enquiries about running Linpack with CUDA (most recently this one,) but it seems nobody actually got all that far.
During the poor weather we “enjoyed” over Christmas, I decided to do a bit of instrumentation and profiling of the v2.0 of the netlib portable implementation and see where the low handing fruit is in terms of offloading parts of the calculation onto the GPU and trying to amortise PCI-e bus overheads wherever practicable. At the moment I have only offloaded the post panel factorization update code, so that the GPU executes dtrsm(), then host and device dgemm() calls are overlapped, each working on some fraction of the total columns in the update region. I think I have gone about as far as I can go in that direction, and I am beginning to get vaguely respectable looking numbers, but I wonder whether anyone else has actually tried this, and what their approach looks like? Most importantly, what sort of computational efficiency numbers have been achieved?
At the moment, about the best I can do on an 3.0GHz AMD Phenom II with a stock 896Mb GTX275 using GotoBlas2 and CUBLAS 2.3 looks something like this:
============================================================
====================
HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
============================================================
====================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 16000
NB : 1024
PMAP : Column-major process mapping
P : 1
Q : 1
PFACT : Crout
NBMIN : 8
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 0
SWAP : Mix (threshold = 192)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
============================================================
====================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC01C2C8 16000 1024 1 1 43.88 6.223e+01
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0030391 ...... PASSED
============================================================
====================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
============================================================
====================
That is only a shade over 50% computational efficiency, which isn’t all that fabulous. Memory size on the GPU seems to be a reasonable bottleneck to improving things, especially with the dgemm() call, because of the shape of the panel matrices and the total FLOPs in the operation, relative to the memory footprint of the product matrix.
Anyone else?