CULAPACK

sachin · April 28, 2009, 4:13am

does anybody know when and if NVIDIA will be releasing a LAPACK routine? even a rough date will be greatly appreciated :D

bgalbraith · April 28, 2009, 6:52pm

I agree that having a CUDA LAPACK library would be nice, and I think it’s a common request on the wish list post and elsewhere. While I don’t have an answer to your question, if you only need a handful of the LAPACK routines, you could always just implement them yourself (rather than waiting for NVIDIA) as they are in LAPACK by replacing the BLAS calls with CUBLAS calls. I ended up doing this for SGETRF and SGETRI in order to have a general matrix inversion routine that uses the GPU.

Byron

jack · April 28, 2009, 7:22pm

You could do that (replace BLAS calls with CUBLAS calls)…but to be honest, that way is terribly inefficient because it doesn’t take kernel launch overhead, differences in reading from device memory, or host ↔ device transfers into account. There are a few LAPACK routines that have been written specifically for CUDA and posted to the forum before that will get you much better performance. However, I still haven’t seen a good matrix inversion routine written/posted yet…so perhaps I will give it a shot at some point.

The ones I’ve seen specifically are QR/LU decompositions and some matrix multiplication stuff. Search the programming forum for posts by vvolkov, he does a lot of work with this kind of stuff, and maybe he’s posted something that will be useful to you.

In any case, I keep hearing that the new release of CUBLAS (with a bunch of new functions implemented) is going to be available “soon”…but it’s been a pretty long time and we’re still waiting. I hope that someone at nVidia makes this and CUFFT a priority soon, since having a ready-to-use library with these functions would make CUDA attractive to many more developers.

bgalbraith · April 28, 2009, 7:42pm

You could do that (replace BLAS calls with CUBLAS calls)…but to be honest, that way is terribly inefficient because it doesn’t take kernel launch overhead, differences in reading from device memory, or host ↔ device transfers into account. There are a few LAPACK routines that have been written specifically for CUDA and posted to the forum before that will get you much better performance. However, I still haven’t seen a good matrix inversion routine written/posted yet…so perhaps I will give it a shot at some point.

The ones I’ve seen specifically are QR/LU decompositions and some matrix multiplication stuff. Search the programming forum for posts by vvolkov, he does a lot of work with this kind of stuff, and maybe he’s posted something that will be useful to you.

In any case, I keep hearing that the new release of CUBLAS (with a bunch of new functions implemented) is going to be available “soon”…but it’s been a pretty long time and we’re still waiting. I hope that someone at nVidia makes this and CUFFT a priority soon, since having a ready-to-use library with these functions would make CUDA attractive to many more developers.

You’re right in that this “naive” approach doesn’t take advantage of potential hardware specific optimizations that could be achieved through clever algorithms. That said, I did minimize that amount of host<->device data transfer and actually achieved reasonable performance for the domain I’m working on. The largest inversion I’ve looked at is 1000x1000 which isn’t that big, relatively speaking. I need to do some analysis to see if using straight LAPACK wouldn’t just be faster. I’ll clean up my code and post it later for people to take a look at as well.

Byron

Mat38 · April 29, 2009, 4:01pm

Hi !

Waiting for a good matrix inversion cuda kernel, would you mind, please, sharing your work ? As far as I am concerned, I am limited by host ↔ device data transfer and maybe your code could help me. Thank you very much in advance for your help.

Best Regards,

Mathieu

sachin · April 29, 2009, 10:16pm

hmm ok. well is there any place people can (or are currently) sharing files on CULAPACK? cos if so I would like to submit some of the function that I have manually done.

basically does NVIDIA do something like GNU where u can just submit the code by various people who are interested in the project? so we can manually make our own LAPACK :P

sachin

jph4599 · April 29, 2009, 11:03pm

I’ve seen a lot of people just attach the code/makefiles/tools they want to share to their forum posts…seems like it would work alright for this too…especially if it was limited to 1 thread.

bgalbraith · April 30, 2009, 6:51am

Ok, here’s my LAPACK-inspired, CUBLAS-based approach to invert a square matrix.

Byron

[codebox]

/*

** Simple matrix inversion routine using CUDA

**

** This reimplements the LAPACK sgetri and associated required routines

** by replacing the BLAS calls with CUBLAS calls.

**

** Byron Galbraith

** Department of Mathematics, Statistics, and Computer Science

** Marquette University

** 2009-04-30

*/

#include <cublas.h>

// Prototypes

int* cudaSgetrf(unsigned int n, float *dA);

void cudaSgetri(unsigned int n, float *dA, int *pivots);

void cudaStrtri(unsigned int n, float *dA);

/*

** cudaInvertMatrix

** Inverts a square matrix in place

** n - matrix dimension

** A - pointer to array of floats representing the matrix in column-major order

*/

extern “C” void

cudaInvertMatrix(unsigned int n, float *A)

{

int *pivots;

float *dA;

cublasInit();

cublasAlloc(n * n, sizeof(float), (void**)&dA);

cublasSetMatrix(n, n, sizeof(float), A, n, dA, n);

// Perform LU factorization

pivots = cudaSgetrf(n, dA);

// Perform inversion on factorized matrix

cudaSgetri(n, dA, pivots);

cublasGetMatrix(n, n, sizeof(float), dA, n, A, n);

cublasFree(dA);

cublasShutdown();

}

/*

** cudaSgetrf

** Performs an in-place LU factorization on a square matrix

** Uses the unblocked BLAS2 approach

*/

int *

cudaSgetrf(unsigned int n, float *dA)

{

int i, pivot, *pivots;

float *offset, factor;

pivots = (int *) calloc(n, sizeof(int));

for(i = 0; i < n; ++i)

pivots[i] = i;

for(i = 0; i < n - 1; i++) {

offset = dA + i*n + i;

pivot = i - 1 + cublasIsamax(n - i, offset, 1);

if(pivot != i) {

  pivots[i] = pivot;

  cublasSswap(n, dA + pivot, n, dA + i, n);

}

cublasGetVector(1, sizeof(float), offset, 1, &factor, 1);

cublasSscal(n - i - 1, 1 / factor, offset + 1, 1);

cublasSger(n - i - 1, n - i - 1, -1.0f, offset + 1, 1, offset + n, n, offset + n + 1, n);

}

return pivots;

}

/*

** cudaSgetri

** Computes the inverse of an LU-factorized square matrix

*/

void

cudaSgetri(unsigned int n, float *dA, int *pivots)

{

int i;

float *dWork, *offset;

// Perform inv(U)

cudaStrtri(n, dA);

// Solve inv(A)*L = inv(U)

cublasAlloc(n - 1, sizeof(float), (void**)&dWork);

for(i = n - 1; i > 0; --i) {

offset = dA + (i - 1)*n + i;

cudaMemcpy(dWork, offset, (n - 1) * sizeof(float), cudaMemcpyDeviceToDevice); 

cublasSscal(n - i, 0, offset, 1);

cublasSgemv('n', n, n - i, -1.0f, dA + i*n, n, dWork, 1, 1.0f, dA + (i-1)*n, 1);

}

cublasFree(dWork);

// Pivot back to original order

for(i = n - 1; i >= 0; --i)

if(i != pivots[i])

  cublasSswap(n, dA + i*n, 1, dA + pivots[i]*n, 1);

}

/*

** cudaStrtri

** Computes the inverse of an upper triangular matrix in place

** Uses the unblocked BLAS2 approach

*/

void

cudaStrtri(unsigned int n, float *dA)

{

int i;

float factor, *offset;

for(i = 0; i < n; ++i) {

offset = dA + i*n;

cublasGetVector(1, sizeof(float), offset + i, 1, &factor, 1);

factor = 1 / factor;

cublasSetVector(1, sizeof(float), &factor, 1, offset + i, 1);

cublasStrmv(‘u’, ‘n’, ‘n’, i, dA, n, offset, 1);

cublasSscal(i, -1 * factor, offset, 1);

}

[/codebox]

Mat38 · April 30, 2009, 9:23am

Hi Byron,

thank you very much for sharing that piece of code. I have tested it and it works fine. Have you done the same kind of work to invert a complex matrix ? If you have, would you mind share the code please ? Thanks in advance.

Regards,

Mathieu

jomoga · April 30, 2009, 4:05pm

You might also want to check out this thread:

[url=“http://forums.nvidia.com/index.php?showtopic=89084”]http://forums.nvidia.com/index.php?showtopic=89084[/url]

Joel

Topic		Replies	Views
Matrix Inversion CUDA Programming and Performance	3	4969	September 24, 2008
LAPACK with cublas - how and is it worth it? CUDA Programming and Performance	14	26326	April 7, 2010
LAPACK need matrix inversion/factorization CUDA Programming and Performance	0	1544	February 22, 2008
Poor results with LAPACK CUDA Programming and Performance	8	12294	March 25, 2009
Inverse Matrix CUDA Programming and Performance	1	1182	June 27, 2009
Matrix Inversion with cublasSgetri GPU-Accelerated Libraries	17	9033	April 10, 2019
Inverse of a 3x3 matrix CUDA Programming and Performance	22	9841	February 20, 2024
About Matrix Inversion CUDA Programming and Performance	7	3352	December 4, 2012
Is there any linear algebra library available for coding CUDA kernels? In developing of CUDA code, I CUDA Programming and Performance	5	1485	July 2, 2012
Matrix inverse from device code CUDA Setup and Installation cuda , kernel , python , pycuda	2	1129	June 25, 2021

CULAPACK

Related topics