Parallel Preconditioners for CG calculating the "inverse" in parallel


i´m currently implementing a conjugate gradient solver for sparse linear systems on the gpu. And i´m quite happy with it´s performance (2.8 gflop / s - 20 times faster then CPU). The only problem is that i need many more iterations (on GPU with Jacobi Preconditioner) then on cpu (with ILU Preconditioner).

The problem is the following: for preconditioned cg i have to solve something like Cx=b in each step.
In case of a Jacobi preconditioner i can calculate x = C^-1 b easily.
In case of a ILU Preconditioner i could calculate LUx=b by performing a forward and a backward substitution.
But my Matrix only contains the main diagonal + 3 upper and 3 lower diagonals. Consequently for an optimal parallelisation i could get a speedup of factor 3. And that would destroy the performance of my whole algorithm (furthermore GPU without parallelisation is slower then cpu). So ILU on the GPU doesn’t make too much sense.

Did anybody implement anything like this or has good experiences with special parallel preconditioners (the calculation of the Preconditioner Matrix doesn’t have to be in parallel. Only the solving of Cx = b has to be in parallel).

Thanks for any ideas


I’m sorry to ask you a question and not being able to answer yours. I’m also trying to implement a CG solver with CUDA and I was wondering if you had some references that I could use to get me started. I understand how CG works, so I’m just looking for articles on the parallelization of things.

And which sparse storage scheme do you use?

Thanks, Bart

I haven’t worked with parallel preconditioners but seeing that you’re working with multi-diagonals, here is an idea.

If you have access to Matlab:

Load your matrix to the Matlab workspace.

In Matlab command line, type:

spy( your_matrix_name );

Enter. You’ll see the map of your matrix.

There is a command called symrcm (for Symmetric Reverse Cuthill-McKee).

Type in command line:

p = symrcm( your_matrix_name );

Press enter. Then type:

spy( p );

You’ll see that the resulting permutation vecotr is divided into two parts, which might be helpful for parallelizing.