Efficient Dot Product Help and other question

I’ve been working on my CUDA/FEM code for a while and am nearing the completion of my project.
I made my own sparse matrix-vector multiplication kernel and it runs pretty well. It gets something like a speed up of ~300 for larger matrices with a test case.
The problem I’m having right now is that the dot product code that is included in the Cuda by Example book seems horrendously inefficient.
It takes something like 120 times longer to run the dot product code vs my matrix-vector code.
I know Cublas has a built in dot product command, but I hear it is fairly slow.
Are there any time efficient functions/algorithms out there?

Also what sort of real world speedup should be expected? Right now I’m getting ~7x for a hair under 3 million non-zeroes in my sparse matrix, the timing includes matrix setup time which is pretty high.

Thanks.

Side note: I have access to a 320M, GTX 275/295 and I think Quatro 5000, in case it matters.

Please refer to these links:
http://www.nvidia.com/object/nvidia_research_pub_001.html
http://www.nvidia.com/object/nvidia_research_pub_013.html

I’ve already read both papers before and they both helped.

My issue isn’t with the sparse matrix aspect but with the dot product aspect of my code.

Since the dot product is entirely bandwidth limited, optimizing memory access is key. If the dot product involves vectors you have previously computed, just do it on the fly while writing those vectors (or even don’t write them out at all if they aren’t needed elsewhere).

Once I start my while loop for the iterative process nothing leaves the graphics card except for the partial result of the dot product as well as the final result of the whole process.

this link may help:
http://developer.amd.com/documentation/articles/Pages/OpenCL-Optimization-Case-Study-Simple-Reductions.aspx

It’s openCL oriented and computes the maximum of a vector. But it’s quite easy to port it to CUDA and compute the scalar product. Just keep in mind that opencl is effectively the same as CUDA: you onlky need to replace get_local_id( 0 ) with threadIdx.x, get_global_id( 0 ) with blockIdx.x, barrier(CLK_LOCAL_MEM_FENCE) with __syncthreads(), etc.

Regards,
enuhtac

I was talking about memory bandwidth though, not PCIe bandwidth. So try to minimize access to (off-chip) device memory as well as transfers to host memory.