I have recently compared my CPU (Intel Q6600 2.4GHz) performance vs GeForce GTX 560 Ti. I use 64bit Debian distro and Intel fortran compiler. In order to perform calculations on GPU i use fortran.c wrapper (without thunking) and CUBLAS library. The performance results (1000 the same dot products performed on two vectors) as a function of vector length are presented on the graphs in attachment. The names of attachments show whether the results were obtained in Single (SP) or Double Precision (DP).
As one can see for vector lengths lower than hundreds of thousands the performance of GPU is much smaller than that of a CPU (although my CPU has 4 cores, I have used only one core in these tests!). There seems to be some sort of penalty (around 0.14ms per iteration). Is that normal? or maybe it has something to do with using fortran.c wrapper? Could anyone of you post similar results for comparison? I want to note that I took into account only the time consumed by those 1000 dot products - without memory allocation and copying of vectors.
Why is that important for me? I need to perform several hundreds of thousands dot products, but relatively small ones (vectors of several thousands in length). From what I can see, using GPU for this certain problem is pointless. Of course there is possibility that I’m doing something wrong - I hope that you can show me how I could improve the performance.
I also have some other question which I probably should ask somewhere else, but I do not lose anything asking here. Is that normal that NVIDIA X Server Settings utility tells me that I have PCIE Gen 1 Bus? Both, the chipset on the motherboard (X38) and graphic card support PCIE Gen 2 Bus… (I hoped for larger than 2.6 GB/s transfers from host to card memory…).