LAPACK with cublas - how and is it worth it?

panda21 · July 27, 2009, 3:08pm

I’ve seen some posts here about people using LAPACK with cublas, how easy is it to do this? I imagine you can’t just link LAPACK against cublas because the function names are all different, could you just make some stubs for BLAS that call cublas and link LAPACK against that?

If that would work is it worth doing, or does having LAPACK not on the gpu counteract the speed up from using cublas? I expect this would mean you end up doing a lot of transfers between the cpu and the gpu every time lapack calls blas which might slow things down too much?

I presume nvidia are working on doing this anyway, so I don’t want to spend too much time on it, unless they definitely aren’t doing it.

avidday · July 27, 2009, 3:42pm

It isn’t all that complex to write C wrappers (or use Flagon from Fortran) to replace selected host BLAS calls with CUBLAS calls in a custom BLAS which can be linked with LAPACK. I have also done the same with SuperLU with some success. I found putting some very simple size heuristics inside the wrapper to select whether to use host or device BLAS works reasonably too. There is a paper by Massimo Fatica from NVIDIA which describes a “hybrid” strategy using host and GPU BLAS simultaneously to accelerate LINPACK which is probably worth reading as well.

jack · July 27, 2009, 4:29pm

It’s not a very good strategy to run LAPACK on top of CUBLAS, because while the idea behind LAPACK is that it can fall back to using the lower-level BLAS functions if necessary for a certain matrix, having to transfer data back and forth between the host and GPU is less than ideal (which is what will happen if you use it with CUBLAS).

Depending on what algorithms you need to run on the GPU, someone may have already released a custom version; for example, vvolkov released his matrix factorization code (search this forum and you’ll find it), and it has near-ideal performance. For some other specialty transforms, I’m afraid the best way is just to write your own GPU version of it, so that you can maximize the amount of time that the algorithm spends working with the data on the GPU.

panda21 · July 27, 2009, 7:29pm

i thought LAPACK was supposed to rely pretty heavily on BLAS, although i expect it depends pretty heavily on what you are doing with it. for now i just need a matrix eigenvalue solver but unfortunately the iterative methods (i think that is what the one in the sdk is) don’t work well for the kind of matrices i am interested in, whereas the lapack ones do.

having looked at that linpack presentation i think i will just try it out and see how it compares to doing it all on the CPU. using a heuristic to decide whether to send things to cublas or a cpu version of blas is a good idea.

thanks!

jack · July 27, 2009, 7:41pm

That is correct. LAPACK does work on top of BLAS. The problem is that it is designed to run on a CPU, so the various checks it does for iterative methods don’t work well when it’s used with CUBLAS. To avoid repeated overhead in each iteration (and slowing down your computing!), it’s better to write a method specifically for the GPU.

wanderine · August 3, 2009, 4:05pm

Has anyone done matrix inversion in CUDA yet? I don’t want to invert a big matrix, I want to invert, for example, 1 million matrices that are 10 x 10.

Sarnath · August 4, 2009, 5:03am

People generally factorize matrices to find solutions to linear equations. This is what I understand…

I am just curious… Why would you want to invert a matrix? Anyway, I would be interested to see how we can do this.

btw, if it is 10x10 then you can arrive at a straight-forward way for inverting a 10x10 matrix… instead of generalizing it.

wanderine · August 4, 2009, 5:51am

I want to do CCA, http://en.wikipedia.org/wiki/Canonical_correlation, on for example 100 000 timeseries, the solution requires two matrix inversions per timeseries. The size is not always 10 x 10 it was just an example.

Sarnath · August 4, 2009, 7:24am

OKay,

Here is my 2 cents for finding determinants…

Perform LU factorization of the matrix and multiply all diagonal elements of U matrix and thats your determinant. The idea occured to me and I cross checked it on Wikipedia as well…

Just in-case if u r looking to do it the determinant way…

wanderine · August 4, 2009, 7:37am

I’m actually looking for eigenvalues and eigenvectors, here is the Matlab code for CCA. I’m actually only looking for the largest eigenvalue and corresponding eigenvector.

[dimx,N] = size(x);

[dimy,N] = size(y);

mx = sum(x,2)/N;

x = x - mx(:,ones(1,N));

my = sum(y,2)/N;

y = y - my(:,ones(1,N));

z = [x;y];

C = 1/N*z*z';

Cxx = C(1:dimx,1:dimx);

Cyy = C(dimx+1:end,dimx+1:end);

Cxy = C(1:dimx,dimx+1:end);

Cyx = Cxy';

invCyy = inv(Cyy);

[wx,rho] = sorteig(inv(Cxx)*Cxy*invCyy*Cyx);

rho = sqrt(rho(1));

wx = wx(:,1);

wy = invCyy*Cyx*wx;

wy = wy/norm(wy);

wk-cof · April 6, 2010, 4:54pm

CULA does eigenvectors and eigenvalues pretty good.

2 disadvantages though:

-it’s working with 1 matrix on the entire GPU at the time (so it won’t give any good results for 1.000.000 matrices 10x10)

-it’s $400 for the premium package

However, there is a free version that does LU and QR factorization.

I’m actually looking for eigenvalues and eigenvectors, here is the Matlab code for CCA. I’m actually only looking for the largest eigenvalue and corresponding eigenvector.
[dimx,N] = size(x);

[dimy,N] = size(y);

mx = sum(x,2)/N;

x = x - mx(:,ones(1,N));

my = sum(y,2)/N;

y = y - my(:,ones(1,N));

z = [x;y];

C = 1/N*z*z';

Cxx = C(1:dimx,1:dimx);

Cyy = C(dimx+1:end,dimx+1:end);

Cxy = C(1:dimx,dimx+1:end);

Cyx = Cxy';

invCyy = inv(Cyy);

[wx,rho] = sorteig(inv(Cxx)*Cxy*invCyy*Cyx);

rho = sqrt(rho(1));

wx = wx(:,1);

wy = invCyy*Cyx*wx;

wy = wy/norm(wy);

wanderine · April 6, 2010, 7:28pm

Ok thanks, but I will probably implement a direct solution for inversion and eigenvalue calculation of 3 x 3 matrices in each thread.

gonnet · April 6, 2010, 9:56pm

I’m surprised no one mentionned MAGMA so far. It’s meant to be GPU-enabled LAPACK library. Seriously, it’s pretty naive to believe that you will ever get performance for free by linking LAPACK with CUBLAS transparently, it would be nice, but this is way ahead of the state-of-the-art research :)

wanderine · April 6, 2010, 10:13pm

Do they support functions such as a matrix inverse in each thread?

Kyle_Spagnoli · April 7, 2010, 5:54pm

Parallel batch processing is on the road map for an eventual CULA release.

Topic		Replies	Views
CULAPACK CUDA Programming and Performance	9	17994	April 30, 2009
Poor results with LAPACK CUDA Programming and Performance	8	12162	March 25, 2009
LAPACK + CUBLAS CUDA Programming and Performance	6	9201	July 8, 2008
LAPACK need matrix inversion/factorization CUDA Programming and Performance	0	1516	February 22, 2008
CUDA Lapack Lapack CUDA Programming and Performance	2	6921	November 3, 2009
Eigen Decomposition in future release? And other various lapack related routine CUDA Programming and Performance	7	13996	September 2, 2008
Matrix Inversion CUDA Programming and Performance	3	4885	September 24, 2008
Does CUDA 3.0 support any LAPACK functions? CUDA Programming and Performance	9	11761	July 14, 2010
Will cuBLAS ever be completed? CUDA Programming and Performance	15	29159	August 31, 2009
Is there any linear algebra library available for coding CUDA kernels? In developing of CUDA code, I CUDA Programming and Performance	5	1409	July 2, 2012

LAPACK with cublas - how and is it worth it?

Related topics