I have just finished my BICGSTAB solver on GPU with CSR matrix format. I have implemented my own vectorized version of SpMV (not so much optimized) along with the scalar version, switching based on the mean number of non-zeros per row. For vector dot and norm2, I use cublas.
Would somebody please let me know how much speed up should I expect in my code? I am getting almost 6x speed up on a system with an average of 88 non-zeros per row. I am using a GTX 260, Ubuntu 8.10 64 bit, CUDA 2.1, driver coming with CUDA 2.1, Core 2 Quad 9650 CPU and 4 GB of RAM. The dimension of the matrix is around 350,000. I am comparing with the same code running on CPU without taking advantage of OpenMP, etc.
Another question, how much do you think I will benefit from CUDA 2.2 zero-copy? I have not yet read the manual, nor played with it.