AmgX preconditioner performance

I am currently experimenting with the linear solvers and preconditioners in AmgX. As a first test, I applied an unpreconditioned PCGF solver to a system with roughly 4*10^6 unknowns from a Poisson-like problem. The solver converges after a little under 3000 iterations, and using eight V100 GPUs, AmgX’s solver is about 20 times faster than the CPU-based PCG implementation that I compare against.

However, so far I was not able to find any preconditioners that further improve performance. Even when I apply a simple Jacobi preconditioner, the elapsed time for the solve increases by 70% despite a slight reduction in the number of iterations.

When I apply a simple multigrid preconditioner (classical AMG, V-cycle, 3 Gauss-Seidel sweeps), the number of outer iterations is reduced by almost 85%, but the elapsed time for the solve increases more than 400-fold compared to the unpreconditioned solve (from 2.5s to over 1000s).

I understand that the Gauss-Seidel smoother, prolongation, and interpolation may only allow for limited parallelism, but in my test a single iteration with the AMG-preconditioned PCGF solver takes as long as solving the entire system without the preconditioner.

Does anyone have some experience with the efficiency of GPU-based preconditioners or can you think of potential problems in my configuration that would cause these performance issues?