I’ve been working on my CUDA/FEM code for a while and am nearing the completion of my project.

I made my own sparse matrix-vector multiplication kernel and it runs pretty well. It gets something like a speed up of ~300 for larger matrices with a test case.

The problem I’m having right now is that the dot product code that is included in the Cuda by Example book seems horrendously inefficient.

It takes something like 120 times longer to run the dot product code vs my matrix-vector code.

I know Cublas has a built in dot product command, but I hear it is fairly slow.

Are there any time efficient functions/algorithms out there?

Also what sort of real world speedup should be expected? Right now I’m getting ~7x for a hair under 3 million non-zeroes in my sparse matrix, the timing includes matrix setup time which is pretty high.

Thanks.

Side note: I have access to a 320M, GTX 275/295 and I think Quatro 5000, in case it matters.