gauss elimination using cublas

hydos · June 1, 2009, 10:49am

I am trying to implement the following loop in cuda …

for i = 1 to n-1
A(i+1:n,i) = A(i+1:n,i) * ( 1 / A(i,i) )â€
â€¦ BLAS 1 (scale a vector)â€
A(i+1:n,i+1:n) = A(i+1:n , i+1:n )
- A(i+1:n , i) * A(i , i+1:n)â€
â€¦ BLAS 2 (rank-1 update)â€

what i have done is this
for(i=0;i<n;i++)
{
p=1/a_h[i*(n+1)+i];

cublasSscal(n,p,&a_d[(i+1)*n+i],sizeof(float));

}
Now for rank 1 update i will have to use sger function as far as i think …
I don know how to use this function as arguments are getting out of control…
I have got one matrix in an array *a…
now can some one help me with this…

I apologize if this question is too easy as i am a beginner …

jack · June 1, 2009, 12:20pm

I’m not sure if you’re implementing this just to learn CUDA, or you have some special problem that you need…but if you don’t mind using someone else’s implementation, vvolkov posted code for an LU factorization in the Development forum (which will serve the same purpose as Gaussian elimination).

avidday · June 1, 2009, 12:49pm

Those arguments in your sscal() call is almost certainly not what you want to do. I would recommend getting yourself a host CPU BLAS implementation and familiarizing yourself with BLAS first (and maybe the underlying matrix mathematics), then code a host CPU version, verify it works, and then worry about writing a CUDA implementation. You need to learn to walk before you can run…

Tell_Me_Why · June 1, 2009, 12:55pm

You may also take a look at LUGPU project, http://www.cs.unc.edu/~geom/LUGPULIB/

And my own question:

Is there any LU/ILUx implementation for GPU in case of sparse matrices? It seems to be quite inefficient for sparse matrices, right?

avidday · June 1, 2009, 1:58pm

Parallel sparse algorithms aren’t easy, and it is pretty obvious that the lack of predictable memory access patterns is a killer in a cache starved architecture with such huge global memory access penalties. There have been a couple of papers on doing sparse LU factorizations on GPUs, but I haven’t seen much useful code in the wild.

My current (and very suboptimal) approach has been to use a hacked version of SuperLU, using a mixture of host and GPU BLAS and tuning the solver blocking and panel factors to wring as much out of it as possible. I am sure there must be a better way, but I am going to have to wait for someone else to think of it :)

Tell_Me_Why · June 1, 2009, 5:08pm

Thanks! Even this may be a good idea! Can you share the code or it is private?

keitat · June 1, 2009, 6:20pm

A typical sparse factorization can be seen as a sequence of dense submatrix computations associated with a tree (calles supernodal tree). The computation is performed by a post-order tree traversal. One of the idea of adapting to GPU is allocating all dense matrix storages on the device memory at once, and adapt the data movement on GPU’s memory based on the tree traversal. This won’t be able to be done by a simple hacking. Definitely, good understanding of sparse factorization in source code level is necessary/

avidday · June 1, 2009, 8:09pm

It is more of a build that a code fork. Basically I have made my own custom BLAS which reimplements a few key functions (basically SGEMV and DGEMV) with my own wrappers, and those contain a simple size test and select to either pass the call to the CPU or GPU BLAS. There is then a single routine in Super LU (sp_ienv) which returns a set of “optimal” tuning parameters. I have done some very limited playing around to see what works better on some typical matrices which crop up in our problems. The standard build system requires some changes to make the whole contraption hold together, but it does work.

Unfortunately the key part (the BLAS) contains code that I don’t have a license to redistribute so you will have to use your imagination. But it certainly isn’t difficult to get going, at least under Linux which is my only CUDA platform at this stage.

avidday · June 1, 2009, 8:14pm

Absolutely. The key part is going to be getting the tree traversal to work on the GPU efficiently, and deciding whether a left or right looking algorithm will work better. That requires a more chops that a humble materials scientist like myself possesses, I am afraid.

Topic		Replies	Views
LU, QR and Cholesky factorizations using GPU CUDA Programming and Performance	100	64072	June 23, 2015
cuSolverSp_LinearSolver LU factorization example CUDA Developer Tools	0	750	May 4, 2020
Gaussian elimination, LU-factorization and shared memory CUDA Programming and Performance	1	1525	November 10, 2013
cusparseScsrilu02 breaks with large matrices GPU-Accelerated Libraries cublas , cusolver , cusparse	10	1312	June 1, 2022
LU factorization code CUDA Programming and Performance	45	91300	June 23, 2015
Use of CUSPARSE for AX=B CUDA Programming and Performance	11	7992	July 22, 2013
Linear equations and sparse matrix CUBLAS, CUSPARSE, CULA CUDA Programming and Performance	2	2618	May 20, 2011
Solve Gaussian elimination with for loop CUDA Programming and Performance	3	5225	December 8, 2013
Solutions of linear equations CUDA Programming and Performance	7	5606	July 16, 2010
How to solve a system of linear equation(matrix of the order of 10000*100000) using CUBLAS GPU-Accelerated Libraries	3	1339	October 16, 2014

gauss elimination using cublas

Related topics