I’m trying to port our machine learning framework to CUDA. The operation we do most is sgemv, ie matrix-vector operations. We are currently using CBLAS (Intel MKL) and have decent speed.
I compared the CBLAS performances to the same operation using cublasSgemv() and found out that CUBLAS is actually slower. I have an Intel Core i7 920 CPU (4x2.6 GHz with HT and TurboSpeed) and a GTX 285 GPU (actually two of them, but I use only one for my tests). The dimensions of the matrix/vector we use are quite small, eg 500x100.
It seems that CUBLAS is 2.5 times slower than CBLAS for this kind of operations. However, if I increase the size of the matrix, CUBLAS gets faster and faster compared to CBLAS.
My questions are the following : is this normal ? Are my matrix too small to allow the parallelism of the GPU to overtake the higher frequency of the CPU ? Is there a way I can benefit from the GPU when doing operations on small matrix like I have (except from the obvious way of multiplying several at the same time) ?
Note that in the CUDA case, all data is on the GPU, ie no CPU-GPU transfer at all.