Does anyone have any plans to produce generic, optimized linear algebra solvers using CUDA (think of some of the solvers in IMSL)? For example, a tri-diagonal solver, a sparse matrix solver, or eigenvalues would be very useful to a large class of problems.

The CUBLAS library is a great step in the right direction. :)

In fact, I am considering a development of a sparse matrix package. I am looking into SPARSE1.4 and UMFPACK5.1 that I am familiar with. I believe the fact that sparse matrix solver is available is pivotal to further exploitation of GPU hardware acceleration in scientific computing community.

CUBLAS 2.0 should be released soon, which may have some more methods in it that could be used to build a linear algebra solver. As for specifically-optimized kernels (e.g. tridiagonal, sparse, etc.), I don’t think there is anything official coming from nVidia, but I know that some people are working on their own implementations of such kernels (like myself) – though I don’t think there’s anything out to the public yet.

I am planing to make a CONJUGATE GRADIENT (krylov types) …(plus others solvers in CUDA . Can some one let me know if such kind of work has been done previously ?

Yes there have been a few trys in that direction. usually the tricky bit is the vector matrix multiply and the large vector dot product. Both of which have nice implementations in the cuda demos and in cudpp. We have our own internal solver written completely in cuda with a very nice preference boost. So it can work. though i can’t share it with you guys … sorry :)

Its very fast, but suffers the obvious problem: no preconditioners!!! This is the next
step, but I’m not sure what type of preconditioners I might be able to compute quickly enough (using, of course,
the GPU). My favorite is ILUT, but can this be thread parallelized???

Let me know, I have no problem providing the code (I think you can email using the board, eh?)

This would be a great project that I would like to collaborate on; iterative solvers and preconditioners. Sort of the Aztecoo of CUDA.

I don’t have time to generate a lot of performance data; perhaps you could contribute some? I’d be happy to post results at the google code site!

My preliminary tests (and the code will go through this) is that the matrix multiply is (on my GeForge 9800M card) 369 times faster than the cpu. Hence iteration proceeds very quickly.

The downside is that without preconditioning, there isn’t yet a win, because it takes so many iterations to solve anything!!

My hope is to generate some interest and maybe get some people helping out and writing some preconditioners???

Thank you for the info. The matrix I’m working on is relatively small and dense. So I’m still trying to figure out if I should focus on direct solver or try iterative solver. The speedup you showed is quite impressive and you are right that the preconditioner will be very important.

I’ve updated the code to be slightly less ugly, including checks for convergence, max iters, etc…
I also uploaded a smaller matrix for testing. On to working on ILUT…