CUBLAS sgemv slower than CBLAS for small matrix sizes


I’m trying to port our machine learning framework to CUDA. The operation we do most is sgemv, ie matrix-vector operations. We are currently using CBLAS (Intel MKL) and have decent speed.

I compared the CBLAS performances to the same operation using cublasSgemv() and found out that CUBLAS is actually slower. I have an Intel Core i7 920 CPU (4x2.6 GHz with HT and TurboSpeed) and a GTX 285 GPU (actually two of them, but I use only one for my tests). The dimensions of the matrix/vector we use are quite small, eg 500x100.

It seems that CUBLAS is 2.5 times slower than CBLAS for this kind of operations. However, if I increase the size of the matrix, CUBLAS gets faster and faster compared to CBLAS.

My questions are the following : is this normal ? Are my matrix too small to allow the parallelism of the GPU to overtake the higher frequency of the CPU ? Is there a way I can benefit from the GPU when doing operations on small matrix like I have (except from the obvious way of multiplying several at the same time) ?

Note that in the CUDA case, all data is on the GPU, ie no CPU-GPU transfer at all.

Yes this is normal, CUBLAS is optimised for very large matrix sizes. For smaller sizes I’ve written my own kernel, and even without much optimisation it’s faster than the CPU running the same thing. Probably not quite as fast as CBLAS though… that said my CPU is now available for other things…

A 500x100 matrix vector multiply only contains 50000 Flops, which is trivial. Probably a couple of orders of magnitude less than what most of the linear algebra codes are intended to be fast at. At that size you probably could engineer something that would be fast on a GT200, but barely.