Poor results with LAPACK

I need to invert a covariance matrix to solve a standard least squares Y=Ax+B, and to get the error covariance matrix in return. I have to invert a lot of small matrixes with size ranging from 20x20 to 2000x2000. A Cholesky LU decomposition from lapack is used to speed up the problem solving.

Since there is no lapack functions in the CUDA libraries, I used the regular lapack library with CUBLAS to handle basic algebra operations: spotrf/spotri. When I compare the CPU vs. GPU performance, the results are good for large matrixes (improvement by a factor 4 or more), but they become very disappointing when the covariance matrix is smaller than 1000x1000. A large majority of the (many) problems I need to solve are small and the GPU inversion is sometimes 10 or 20 times slower than the CPU equivalent (same code, different BLAS library used).

It seems that the time needed to allocate/desallocate GPU memory and to transfer data from/to the main memory to/from the GPU memory is killing the solver performance. When the covariance matrix is large (N > 1500) data handling requires less time than actual BLAS computations hence the good results observed with the GPU. However when the matrix is small, the BLAS inversion (LU decomposition) is so fast that data allocations and transfer are bogging down the computation.

Thus, my question is threefold:

    Do you have an example of similar least square solving with LU decomposition I could mimic?

    Do you have any plan of releasing a high-level LAPACK-like GPU library (even partial) where the main LU decomposition engine would be optimized (no endless data transfer between main memory and gpu memory)?

    Do you have any tip or suggestion I could use to benefit from the GPU firepower for simple (but numerous) inversion problems?

You can see my results and my code here.

I’m looking forward to reading your reply,

Keep up the good work on CUDA,

All the best,

F. Briol

Hi fbriol,

You seem to have adapted the “thunking” Fortran wrappers from the SDK for C. This means every CUBLAS call is wrapped in an alloc/copy/ copy/free wrapper. That’s eating up all the performance for small matrices.

Instead, you should allocate and free the data once, and otherwise keep the data on the
GPU, only copying back the portions needed for control flow. Admittedly, even that
alone can be a performance problem, but it should be a lot better than your first attempt.

Hope that helps, and let us know your results!


Hi Mark,

Thanks a lot for your help.

Thanks a lot for your help.

I modified my wrapper to share memory between functions (i assume the functions are never call through threads). So for each cublas call, the wrapper do only copy to gpu/copy to cpu. Performances are better but not dramiticaly better (you can see my benchmark results and my new wrapper in my webpage).

I think to have a better performance, i need to do: function -> copy to gpu -> execute all code on gpu -> copy to cpu.

But to do that, i need to call blas functions in GPU, but can’t write this code :

__global__ void

testKernel( float* data, float* out, int n) 


    *out = cublasSasum(n, data, 1);


How can I do something like this ?

All the best,



You need to avoid copies between CPU and GPU whenever possible. The general approach should look like this:

(1) allocate device memory
(2) copy data from host to device
(3) call as many CUBLAS functions as needed to process the data without copying the results back to the GPU in between calls. The GPU device does data processing, and the host does control flow. The control flow may require reading back small pieces of data from the device occasionally, e.g. use of ISAMAX in pivoting.
(4) copy results back from device to host.
(5) free device memory

This means that you also need to allocate intermediate data on the GPU in (1) – these intermediate arrays will be used to store the output of one CUBLAS function which is then used as input to the next CUBLAS function.

Other notes:
In general, matrices [or matrix tiles] should not be too small, in order to utilize
the full parallelism of the device. Example: On G80, SGEMM needs matrices
of at least 128x128 to utilize all available threads; this size will move upwards with
future devices. Other BLAS functions require similar sized matrices to keep the
device well filled. Because of the tiling used in CUBLAS, no dimension of a matrix
should be less than 32 for good performance, and it is best if all dimensions are
multiples of 32, since handling end-cases is expensive.


Did you succeed in improving the performance using CUBLAS with LAPACK? I’m in a similar situation to you: I need to operate with CUBLAS then use LAPACK and finish with CUBLAS again. I presume that I can’t help but move the matrices to host memory for use with LAPACK or can I? I’m also considering using JAMA (http://math.nist.gov/tnt/index.html) instead of LAPACK.
Any help is appreciated. Thanks in advance.

I would like to echo the request for standardized CUDA optimized linear equation solvers. My application is on large arrays (a couple thousand on a side). (I saw another request for versions tuned for many small matrices; that’s not what I’m looking for).

Linear algebra is so ubiquitous that I’m sure you guys won’t stop at BLAS Level 3, but I thought I’d throw in some words of encouragement anyway!


I’ve posted a few threads in the General Computing section about this…basically, the consensus is that a totally CUDA-optimized library would need to be implemented with the LAPACK interface. Just wrapping the CUBLAS functionality with the normal LAPACK doesn’t take good advantage of CUDA’s strengths; one of the researchers working with nVidia on linear algebra code (forum user vvolkov) has created some optimized QR/LU/Cholesky decomposition code (which would be heavily relied on for something like this) that approaches the theoretical maximum performance of the GPU.

If you just need to invert a matrix, there is a forum user that posted his kernel code to do a matrix inversion in the General Computing forum that you could use. Hopefully, nVidia will put some time into making a CUDA optimized, COMPLETE version of BLAS/LAPACK soon. I think it would be an instant ‘killer app’ for anyone that does modeling/simulation/analysis.

EDIT: I forgot to mention…even the most highly tuned GPU code may not beat the CPU for small matrices due to the transfer overhead. When I read vvolkov’s paper about his decomposition code, the break-even point was somewhere around 512x512 matrices for most of them (if I remember correctly). If you’re already getting some speedups in your code by just using the standard LAPACK wrapper, why not put a branch in your code that checks the size of the matrix, then runs the calculation on the CPU or GPU accordingly?

EDIT 2: Here are some other related topics for you guys:

So, does anybody have code for an optimized linear least squares solver that they are willing to share?

I don’t know about a whole solver, but vvolkov posted his optimized code for QR and LU decompositions a while back. You could probably adapt the QR code to do the linear least squares problem without too much trouble.