does anybody know when and if NVIDIA will be releasing a LAPACK routine? even a rough date will be greatly appreciated :D

I agree that having a CUDA LAPACK library would be nice, and I think it’s a common request on the wish list post and elsewhere. While I don’t have an answer to your question, if you only need a handful of the LAPACK routines, you could always just implement them yourself (rather than waiting for NVIDIA) as they are in LAPACK by replacing the BLAS calls with CUBLAS calls. I ended up doing this for SGETRF and SGETRI in order to have a general matrix inversion routine that uses the GPU.

  • Byron

You could do that (replace BLAS calls with CUBLAS calls)…but to be honest, that way is terribly inefficient because it doesn’t take kernel launch overhead, differences in reading from device memory, or host <–> device transfers into account. There are a few LAPACK routines that have been written specifically for CUDA and posted to the forum before that will get you much better performance. However, I still haven’t seen a good matrix inversion routine written/posted yet…so perhaps I will give it a shot at some point.

The ones I’ve seen specifically are QR/LU decompositions and some matrix multiplication stuff. Search the programming forum for posts by vvolkov, he does a lot of work with this kind of stuff, and maybe he’s posted something that will be useful to you.

In any case, I keep hearing that the new release of CUBLAS (with a bunch of new functions implemented) is going to be available “soon”…but it’s been a pretty long time and we’re still waiting. I hope that someone at nVidia makes this and CUFFT a priority soon, since having a ready-to-use library with these functions would make CUDA attractive to many more developers.

You’re right in that this “naive” approach doesn’t take advantage of potential hardware specific optimizations that could be achieved through clever algorithms. That said, I did minimize that amount of host<->device data transfer and actually achieved reasonable performance for the domain I’m working on. The largest inversion I’ve looked at is 1000x1000 which isn’t that big, relatively speaking. I need to do some analysis to see if using straight LAPACK wouldn’t just be faster. I’ll clean up my code and post it later for people to take a look at as well.

  • Byron

Hi !

Waiting for a good matrix inversion cuda kernel, would you mind, please, sharing your work ? As far as I am concerned, I am limited by host <-> device data transfer and maybe your code could help me. Thank you very much in advance for your help.

Best Regards,


hmm ok. well is there any place people can (or are currently) sharing files on CULAPACK? cos if so I would like to submit some of the function that I have manually done.

basically does NVIDIA do something like GNU where u can just submit the code by various people who are interested in the project? so we can manually make our own LAPACK :P


I’ve seen a lot of people just attach the code/makefiles/tools they want to share to their forum posts…seems like it would work alright for this too…especially if it was limited to 1 thread.

Ok, here’s my LAPACK-inspired, CUBLAS-based approach to invert a square matrix.

  • Byron



** Simple matrix inversion routine using CUDA


** This reimplements the LAPACK sgetri and associated required routines

** by replacing the BLAS calls with CUBLAS calls.


** Byron Galbraith

** Department of Mathematics, Statistics, and Computer Science

** Marquette University

** 2009-04-30


#include <cublas.h>

// Prototypes

int* cudaSgetrf(unsigned int n, float *dA);

void cudaSgetri(unsigned int n, float *dA, int *pivots);

void cudaStrtri(unsigned int n, float *dA);


** cudaInvertMatrix

** Inverts a square matrix in place

** n - matrix dimension

** A - pointer to array of floats representing the matrix in column-major order


extern “C” void

cudaInvertMatrix(unsigned int n, float *A)


int *pivots;

float *dA;


cublasAlloc(n * n, sizeof(float), (void**)&dA);

cublasSetMatrix(n, n, sizeof(float), A, n, dA, n);

// Perform LU factorization

pivots = cudaSgetrf(n, dA);

// Perform inversion on factorized matrix

cudaSgetri(n, dA, pivots);

cublasGetMatrix(n, n, sizeof(float), dA, n, A, n);





** cudaSgetrf

** Performs an in-place LU factorization on a square matrix

** Uses the unblocked BLAS2 approach


int *

cudaSgetrf(unsigned int n, float *dA)


int i, pivot, *pivots;

float *offset, factor;

pivots = (int *) calloc(n, sizeof(int));

for(i = 0; i < n; ++i)

pivots[i] = i;

for(i = 0; i < n - 1; i++) {

offset = dA + i*n + i;

pivot = i - 1 + cublasIsamax(n - i, offset, 1);

if(pivot != i) {

  pivots[i] = pivot;

  cublasSswap(n, dA + pivot, n, dA + i, n);


cublasGetVector(1, sizeof(float), offset, 1, &factor, 1);

cublasSscal(n - i - 1, 1 / factor, offset + 1, 1);

cublasSger(n - i - 1, n - i - 1, -1.0f, offset + 1, 1, offset + n, n, offset + n + 1, n);


return pivots;



** cudaSgetri

** Computes the inverse of an LU-factorized square matrix



cudaSgetri(unsigned int n, float *dA, int *pivots)


int i;

float *dWork, *offset;

// Perform inv(U)

cudaStrtri(n, dA);

// Solve inv(A)*L = inv(U)

cublasAlloc(n - 1, sizeof(float), (void**)&dWork);

for(i = n - 1; i > 0; --i) {

offset = dA + (i - 1)*n + i;

cudaMemcpy(dWork, offset, (n - 1) * sizeof(float), cudaMemcpyDeviceToDevice); 

cublasSscal(n - i, 0, offset, 1);

cublasSgemv('n', n, n - i, -1.0f, dA + i*n, n, dWork, 1, 1.0f, dA + (i-1)*n, 1);



// Pivot back to original order

for(i = n - 1; i >= 0; --i)

if(i != pivots[i])

  cublasSswap(n, dA + i*n, 1, dA + pivots[i]*n, 1);



** cudaStrtri

** Computes the inverse of an upper triangular matrix in place

** Uses the unblocked BLAS2 approach



cudaStrtri(unsigned int n, float *dA)


int i;

float factor, *offset;

for(i = 0; i < n; ++i) {

offset = dA + i*n;

cublasGetVector(1, sizeof(float), offset + i, 1, &factor, 1);

factor = 1 / factor;

cublasSetVector(1, sizeof(float), &factor, 1, offset + i, 1);  

cublasStrmv(‘u’, ‘n’, ‘n’, i, dA, n, offset, 1);

cublasSscal(i, -1 * factor, offset, 1);




Hi Byron,

thank you very much for sharing that piece of code. I have tested it and it works fine. Have you done the same kind of work to invert a complex matrix ? If you have, would you mind share the code please ? Thanks in advance.



You might also want to check out this thread: