Parallel preconditioning for CG algorithm ILU(0)

Hi at all,

perhaps somebody can help me with this issue. First, I have developed an CG-solver for huge sparse matrices and it works fine on my Quadro 5000( 0,8s of compute time for 4,4 millions nonzeros with a tolerance of 1e-3). After that I implemented an incomplete LU fraction with no fill in(ILU(0)). Thats a great preconditioner, breaks down the iteration number from 575(only CG) to 86(ILU(0)+ CG). But the ILU is still sequential. So I tried to parallize this, after a few papers, it was clear that parallelising of ILU is not easy. So my next step was to find a way to split my problem, in few smaller problems(domain decomposition), I choose a blocked jacobi decomposition, because each block has a diagonal shape, which is important for the ILU algorithm(each diagonal element has to be non zero). After I implemented this, I tested my preconditioner on a quadcore with this type of preconditioning:

  1. split my intput matrix(csr) and right hand side in 4 blocked jacobi matrices(also csr)
  2. on each core I do an ILU decomposition
  3. on each core I run an PCG with one of the blocked matrices and the ILU preconditioner(PCG algorithm from Saad)
  4. after a few steps I put the four solution vectors in one(only copy the results) and started an CG with the input matrix/right hand side and the start solution of my 4 precondition matrices.

But the solution does not fit…so I hope somebody has an idea. If a working solution could be found, I would share my code. If it works on a quadcore I would try to port it to GPU.

Thanks,
Jürgen

Hi at all,

perhaps somebody can help me with this issue. First, I have developed an CG-solver for huge sparse matrices and it works fine on my Quadro 5000( 0,8s of compute time for 4,4 millions nonzeros with a tolerance of 1e-3). After that I implemented an incomplete LU fraction with no fill in(ILU(0)). Thats a great preconditioner, breaks down the iteration number from 575(only CG) to 86(ILU(0)+ CG). But the ILU is still sequential. So I tried to parallize this, after a few papers, it was clear that parallelising of ILU is not easy. So my next step was to find a way to split my problem, in few smaller problems(domain decomposition), I choose a blocked jacobi decomposition, because each block has a diagonal shape, which is important for the ILU algorithm(each diagonal element has to be non zero). After I implemented this, I tested my preconditioner on a quadcore with this type of preconditioning:

  1. split my intput matrix(csr) and right hand side in 4 blocked jacobi matrices(also csr)
  2. on each core I do an ILU decomposition
  3. on each core I run an PCG with one of the blocked matrices and the ILU preconditioner(PCG algorithm from Saad)
  4. after a few steps I put the four solution vectors in one(only copy the results) and started an CG with the input matrix/right hand side and the start solution of my 4 precondition matrices.

But the solution does not fit…so I hope somebody has an idea. If a working solution could be found, I would share my code. If it works on a quadcore I would try to port it to GPU.

Thanks,
Jürgen