How to get best performance of PETSc running with CUSP?

Hi everyone,

I am working with PETSc 3.7.6 library and see that performance of PETSc+CUSP is worse than PETSc reference implementation.

I configured PETSc with cusplibrary-0.4.0 and CUDA5.5. Now I compare two performance runs of PETSc example that solves Laplacian PDE using BiCGStab and jacobi preconditioner (ex 39 in petscdir/src/ksp/ksp/examples/tests).

I compare two runs.
First, with cusp enabled matrix operations on one Pascal GPU:
mpiexec.hydra -n ${i} ./ex39 -ksp_type fbcgs -ksp_rtol 1.e-8 -sub_ksp_type bcgs -sub_ksp_rtol 1.e-8 -pc_type jacobi -ksp_converged_reason -n1 128 -n2 512 -n3 512 -log_view -mat_type aijcusp
I variate number of MPI processes to simulate on GPU, the best result I see with 10 MPIs - ~ 68sec.

Second, with reference PETSc implementation for sparse BLAS computed on host:
mpiexec.hydra -n ${i} ./ex39 -ksp_type fbcgs -ksp_rtol 1.e-8 -sub_ksp_type bcgs -sub_ksp_rtol 1.e-8 -pc_type jacobi -ksp_converged_reason -n1 128 -n2 512 -n3 512 -log_view -mat_type aij
Performance of this run is best on 8 MPIs ~ 48 sec.

80% of computational time is spent in SpMV function. I expect to see significant performance improvement when switching from a PETSc loop to GPU implementation of MV in CUSP. Am I running executable wrong?

Thank you,