Cublas, keep results on device

I have written an implementation of the conjugate gradients iterative solver in CUDA. I use it only on a very special matrix, so I’ve written an optimized sparse matrix-vector product routine for that particular matrix. The rest of the routine uses exclusively the Cublas library.
Performance of CG is disappointing, but as it turns out, the matrix-vector product takes about 13% of the total computation time. Usually this is closer to 80/90%. This suggests to me that cublas is inefficient, because the result value is constantly copied back and forth between the host and device.
I want to keep the results in device memory, so they are directly available to other kernels. I can use the source code of cublas 1.1 and change it myself, but it is old and I think 3.0 has better performance.
Or would you, NVIDIA, make the source code of cublas 3.0 (or later) available to registered developers?

You’re right, CUBLAS dot and norm functions are inefficient because they return values to host memory. Can you combine some of the CUBLAS calls into one kernel? You could also have a look at the source code of the Thrust library, as that has a reduce function you could probably extend to meet your needs. At least you won’t have to wait for Nvidia to release the CUBLAS source code.