LAPACK with cublas - how and is it worth it?

I’ve seen some posts here about people using LAPACK with cublas, how easy is it to do this? I imagine you can’t just link LAPACK against cublas because the function names are all different, could you just make some stubs for BLAS that call cublas and link LAPACK against that?

If that would work is it worth doing, or does having LAPACK not on the gpu counteract the speed up from using cublas? I expect this would mean you end up doing a lot of transfers between the cpu and the gpu every time lapack calls blas which might slow things down too much?

I presume nvidia are working on doing this anyway, so I don’t want to spend too much time on it, unless they definitely aren’t doing it.

It isn’t all that complex to write C wrappers (or use Flagon from Fortran) to replace selected host BLAS calls with CUBLAS calls in a custom BLAS which can be linked with LAPACK. I have also done the same with SuperLU with some success. I found putting some very simple size heuristics inside the wrapper to select whether to use host or device BLAS works reasonably too. There is a paper by Massimo Fatica from NVIDIA which describes a “hybrid” strategy using host and GPU BLAS simultaneously to accelerate LINPACK which is probably worth reading as well.

It’s not a very good strategy to run LAPACK on top of CUBLAS, because while the idea behind LAPACK is that it can fall back to using the lower-level BLAS functions if necessary for a certain matrix, having to transfer data back and forth between the host and GPU is less than ideal (which is what will happen if you use it with CUBLAS).

Depending on what algorithms you need to run on the GPU, someone may have already released a custom version; for example, vvolkov released his matrix factorization code (search this forum and you’ll find it), and it has near-ideal performance. For some other specialty transforms, I’m afraid the best way is just to write your own GPU version of it, so that you can maximize the amount of time that the algorithm spends working with the data on the GPU.

i thought LAPACK was supposed to rely pretty heavily on BLAS, although i expect it depends pretty heavily on what you are doing with it. for now i just need a matrix eigenvalue solver but unfortunately the iterative methods (i think that is what the one in the sdk is) don’t work well for the kind of matrices i am interested in, whereas the lapack ones do.

having looked at that linpack presentation i think i will just try it out and see how it compares to doing it all on the CPU. using a heuristic to decide whether to send things to cublas or a cpu version of blas is a good idea.


That is correct. LAPACK does work on top of BLAS. The problem is that it is designed to run on a CPU, so the various checks it does for iterative methods don’t work well when it’s used with CUBLAS. To avoid repeated overhead in each iteration (and slowing down your computing!), it’s better to write a method specifically for the GPU.

Has anyone done matrix inversion in CUDA yet? I don’t want to invert a big matrix, I want to invert, for example, 1 million matrices that are 10 x 10.

People generally factorize matrices to find solutions to linear equations. This is what I understand…

I am just curious… Why would you want to invert a matrix? Anyway, I would be interested to see how we can do this.

btw, if it is 10x10 then you can arrive at a straight-forward way for inverting a 10x10 matrix… instead of generalizing it.

I want to do CCA,, on for example 100 000 timeseries, the solution requires two matrix inversions per timeseries. The size is not always 10 x 10 it was just an example.


Here is my 2 cents for finding determinants…

Perform LU factorization of the matrix and multiply all diagonal elements of U matrix and thats your determinant. The idea occured to me and I cross checked it on Wikipedia as well…

Just in-case if u r looking to do it the determinant way…

I’m actually looking for eigenvalues and eigenvectors, here is the Matlab code for CCA. I’m actually only looking for the largest eigenvalue and corresponding eigenvector.

[dimx,N] = size(x);

[dimy,N] = size(y);

mx = sum(x,2)/N;

x = x - mx(:,ones(1,N));

my = sum(y,2)/N;

y = y - my(:,ones(1,N));

z = [x;y];

C = 1/N*z*z';

Cxx = C(1:dimx,1:dimx);

Cyy = C(dimx+1:end,dimx+1:end);

Cxy = C(1:dimx,dimx+1:end);

Cyx = Cxy';

invCyy = inv(Cyy);

[wx,rho] = sorteig(inv(Cxx)*Cxy*invCyy*Cyx);

rho = sqrt(rho(1));

wx = wx(:,1);

wy = invCyy*Cyx*wx;

wy = wy/norm(wy);

CULA does eigenvectors and eigenvalues pretty good.

2 disadvantages though:

-it’s working with 1 matrix on the entire GPU at the time (so it won’t give any good results for 1.000.000 matrices 10x10)

-it’s $400 for the premium package

However, there is a free version that does LU and QR factorization.

Ok thanks, but I will probably implement a direct solution for inversion and eigenvalue calculation of 3 x 3 matrices in each thread.

I’m surprised no one mentionned MAGMA so far. It’s meant to be GPU-enabled LAPACK library. Seriously, it’s pretty naive to believe that you will ever get performance for free by linking LAPACK with CUBLAS transparently, it would be nice, but this is way ahead of the state-of-the-art research :)

Do they support functions such as a matrix inverse in each thread?

Parallel batch processing is on the road map for an eventual CULA release.