Linear Algebra Solvers

Does anyone have any plans to produce generic, optimized linear algebra solvers using CUDA (think of some of the solvers in IMSL)? For example, a tri-diagonal solver, a sparse matrix solver, or eigenvalues would be very useful to a large class of problems.

The CUBLAS library is a great step in the right direction. :)

Yes I do agree, and I am searching for a CUDA implementation of a linear system solver… Did you do any progress? I’ll let you know mines

In fact, I am considering a development of a sparse matrix package. I am looking into SPARSE1.4 and UMFPACK5.1 that I am familiar with. I believe the fact that sparse matrix solver is available is pivotal to further exploitation of GPU hardware acceleration in scientific computing community.

Just wanted to push the topic a bit.

Something like the LAPACK routines for linear equations are on top of my wish list…

(btw, thx for CUBLAS!)


You can find some dense results at the CUDA showcase (

Thx, I must have overseen some of the papers in the showcase

do you have some idea about the sparse matrix iterative solver?

I am also interested in iterative solvers, especially Krylov subspace methods.

CUBLAS 2.0 should be released soon, which may have some more methods in it that could be used to build a linear algebra solver. As for specifically-optimized kernels (e.g. tridiagonal, sparse, etc.), I don’t think there is anything official coming from nVidia, but I know that some people are working on their own implementations of such kernels (like myself) – though I don’t think there’s anything out to the public yet.

Shameless plug:

I am planing to make a CONJUGATE GRADIENT (krylov types) …(plus others solvers in CUDA . Can some one let me know if such kind of work has been done previously ?

Thanks all

Yes there have been a few trys in that direction. usually the tricky bit is the vector matrix multiply and the large vector dot product. Both of which have nice implementations in the cuda demos and in cudpp. We have our own internal solver written completely in cuda with a very nice preference boost. So it can work. though i can’t share it with you guys … sorry :)

okay thanks for that info :thumbup: … can you throw some light on the parallel strategy you used or is that also restricted ? :thumbsdown::/


I just finished up my GMRES solver this weekend.

Its very fast, but suffers the obvious problem: no preconditioners!!! This is the next
step, but I’m not sure what type of preconditioners I might be able to compute quickly enough (using, of course,
the GPU). My favorite is ILUT, but can this be thread parallelized???

Let me know, I have no problem providing the code (I think you can email using the board, eh?)

This would be a great project that I would like to collaborate on; iterative solvers and preconditioners. Sort of the Aztecoo of CUDA.

You could attach the code to your post.
Nice picture, is that a separation in a diffuser?

GMRES code:

I created a google code project and added the solver.
Its ugly, but it works and might be a start to something more???

It can be accessed by svn:

Non-members may check out a read-only working copy anonymously over HTTP.

svn checkout cudaztec-read-only

If anyone wants, I would be happy to add them and make this into something.

Let me know if there are problems with the svn checkout.

Thank you very much for sharing the code. Could you please give some performance data of the code?


Again, the code is very rough for the moment!

I don’t have time to generate a lot of performance data; perhaps you could contribute some? I’d be happy to post results at the google code site!

My preliminary tests (and the code will go through this) is that the matrix multiply is (on my GeForge 9800M card) 369 times faster than the cpu. Hence iteration proceeds very quickly.

The downside is that without preconditioning, there isn’t yet a win, because it takes so many iterations to solve anything!!

My hope is to generate some interest and maybe get some people helping out and writing some preconditioners???

And to talk about some design ideas…

So, please, let me know what you think!

Thank you for the info. The matrix I’m working on is relatively small and dense. So I’m still trying to figure out if I should focus on direct solver or try iterative solver. The speedup you showed is quite impressive and you are right that the preconditioner will be very important.

I’ve updated the code to be slightly less ugly, including checks for convergence, max iters, etc…
I also uploaded a smaller matrix for testing. On to working on ILUT…

(cuda GMRES solver)