i´m currently implementing a conjugate gradient solver for sparse linear systems on the gpu. And i´m quite happy with it´s performance (2.8 gflop / s - 20 times faster then CPU). The only problem is that i need many more iterations (on GPU with Jacobi Preconditioner) then on cpu (with ILU Preconditioner).

The problem is the following: for preconditioned cg i have to solve something like Cx=b in each step.
In case of a Jacobi preconditioner i can calculate x = C^-1 b easily.
In case of a ILU Preconditioner i could calculate LUx=b by performing a forward and a backward substitution.
But my Matrix only contains the main diagonal + 3 upper and 3 lower diagonals. Consequently for an optimal parallelisation i could get a speedup of factor 3. And that would destroy the performance of my whole algorithm (furthermore GPU without parallelisation is slower then cpu). So ILU on the GPU doesn’t make too much sense.

Did anybody implement anything like this or has good experiences with special parallel preconditioners (the calculation of the Preconditioner Matrix doesn’t have to be in parallel. Only the solving of Cx = b has to be in parallel).

I’m sorry to ask you a question and not being able to answer yours. I’m also trying to implement a CG solver with CUDA and I was wondering if you had some references that I could use to get me started. I understand how CG works, so I’m just looking for articles on the parallelization of things.