cublas ddot performance how to improve the performance?

Hello everyone.

I have recently compared my CPU (Intel Q6600 2.4GHz) performance vs GeForce GTX 560 Ti. I use 64bit Debian distro and Intel fortran compiler. In order to perform calculations on GPU i use fortran.c wrapper (without thunking) and CUBLAS library. The performance results (1000 the same dot products performed on two vectors) as a function of vector length are presented on the graphs in attachment. The names of attachments show whether the results were obtained in Single (SP) or Double Precision (DP).

As one can see for vector lengths lower than hundreds of thousands the performance of GPU is much smaller than that of a CPU (although my CPU has 4 cores, I have used only one core in these tests!). There seems to be some sort of penalty (around 0.14ms per iteration). Is that normal? or maybe it has something to do with using fortran.c wrapper? Could anyone of you post similar results for comparison? I want to note that I took into account only the time consumed by those 1000 dot products - without memory allocation and copying of vectors.

Why is that important for me? I need to perform several hundreds of thousands dot products, but relatively small ones (vectors of several thousands in length). From what I can see, using GPU for this certain problem is pointless. Of course there is possibility that I’m doing something wrong - I hope that you can show me how I could improve the performance.

I also have some other question which I probably should ask somewhere else, but I do not lose anything asking here. Is that normal that NVIDIA X Server Settings utility tells me that I have PCIE Gen 1 Bus? Both, the chipset on the motherboard (X38) and graphic card support PCIE Gen 2 Bus… (I hoped for larger than 2.6 GB/s transfers from host to card memory…).

ddot has the absolutely worst computational density possible with one memory access per arithmetic operation. It’s cost is entirely in moving the data around, so sum the products just where the data is generated, without even going through memory. So if the data happens to be generated by the CPU, do the sum there, no need to go through host memory, PCIe bus and device memory (although the device memory could be cut out of the loop with mapped memory, but that’s the fastest part).

I assume when preparing the graphs, you generated the data on the CPU. Even if you didn’t count the time for host->device and device->host copies, this still gives the CPU a big advantage, because in the CPU version the data will not even hit main memory, but just stay in the CPU cache. Only when the vector size exceeds the CPU cache size, you start seeing the ~10x larger device memory bandwidth.

Regarding the PCIe bus speed, you can use pinned memory to get transfers to roughly twice the speed of swappable (unpinned) memory.

As Tera points out, GPU accleration of Level 1 BLAS functions is basically a memory bandwidth limted problem, and as your results show, the problem size needs to be large enough to fully utilize GPU available bandwdith before you will see any speed up over the CPU. Even then, the upper bound on speed up will be the ratio of host to GPU memory bandwidth, which is usually in the 5-10 times range. Add in the overhead costs of CPU-GPU transfer and most of that 5-10 times gets eroded. This is why most people aim to express their linear algebra operations using Level 2 or Level 3 BLAS functions, because their FLOP:memory transaction ratio are much more advantageous than Level 1 operations. Seeing as you have many dot products to perform, is there any possibility of batching them and using gemv() or gemm() instead?

Thank you very much for your replies. Indeed I can use gemm instead of dot product. I didn’t perform detailed tests, but from what I have seen, the time of most demanding computations in my program decreased much more than 10 fold. That’s quite an improvement! (I didn’t expect such change at all).

I will also take a look at pinned memory, hopefully right now I’m not limited by data transfers (I transfer only several MB of data).

Again - thank you very much for help!