gauss elimination using cublas

I am trying to implement the following loop in cuda …

for i = 1 to n-1
A(i+1:n,i) = A(i+1:n,i) * ( 1 / A(i,i) )‏
… BLAS 1 (scale a vector)‏
A(i+1:n,i+1:n) = A(i+1:n , i+1:n )
- A(i+1:n , i) * A(i , i+1:n)‏
… BLAS 2 (rank-1 update)‏

what i have done is this
for(i=0;i<n;i++)
{
p=1/a_h[i*(n+1)+i];

cublasSscal(n,p,&a_d[(i+1)*n+i],sizeof(float));

}
Now for rank 1 update i will have to use sger function as far as i think …
I don know how to use this function as arguments are getting out of control…
I have got one matrix in an array *a…
now can some one help me with this…

I apologize if this question is too easy as i am a beginner …

I’m not sure if you’re implementing this just to learn CUDA, or you have some special problem that you need…but if you don’t mind using someone else’s implementation, vvolkov posted code for an LU factorization in the Development forum (which will serve the same purpose as Gaussian elimination).

Those arguments in your sscal() call is almost certainly not what you want to do. I would recommend getting yourself a host CPU BLAS implementation and familiarizing yourself with BLAS first (and maybe the underlying matrix mathematics), then code a host CPU version, verify it works, and then worry about writing a CUDA implementation. You need to learn to walk before you can run…

You may also take a look at LUGPU project, http://www.cs.unc.edu/~geom/LUGPULIB/

And my own question:

Is there any LU/ILUx implementation for GPU in case of sparse matrices? It seems to be quite inefficient for sparse matrices, right?

Parallel sparse algorithms aren’t easy, and it is pretty obvious that the lack of predictable memory access patterns is a killer in a cache starved architecture with such huge global memory access penalties. There have been a couple of papers on doing sparse LU factorizations on GPUs, but I haven’t seen much useful code in the wild.

My current (and very suboptimal) approach has been to use a hacked version of SuperLU, using a mixture of host and GPU BLAS and tuning the solver blocking and panel factors to wring as much out of it as possible. I am sure there must be a better way, but I am going to have to wait for someone else to think of it :)

Thanks! Even this may be a good idea! Can you share the code or it is private?

A typical sparse factorization can be seen as a sequence of dense submatrix computations associated with a tree (calles supernodal tree). The computation is performed by a post-order tree traversal. One of the idea of adapting to GPU is allocating all dense matrix storages on the device memory at once, and adapt the data movement on GPU’s memory based on the tree traversal. This won’t be able to be done by a simple hacking. Definitely, good understanding of sparse factorization in source code level is necessary/

It is more of a build that a code fork. Basically I have made my own custom BLAS which reimplements a few key functions (basically SGEMV and DGEMV) with my own wrappers, and those contain a simple size test and select to either pass the call to the CPU or GPU BLAS. There is then a single routine in Super LU (sp_ienv) which returns a set of “optimal” tuning parameters. I have done some very limited playing around to see what works better on some typical matrices which crop up in our problems. The standard build system requires some changes to make the whole contraption hold together, but it does work.

Unfortunately the key part (the BLAS) contains code that I don’t have a license to redistribute so you will have to use your imagination. But it certainly isn’t difficult to get going, at least under Linux which is my only CUDA platform at this stage.

Absolutely. The key part is going to be getting the tree traversal to work on the GPU efficiently, and deciding whether a left or right looking algorithm will work better. That requires a more chops that a humble materials scientist like myself possesses, I am afraid.