Matrix Multiplications, Latency, CUDA vs OpenMP

Hi everyone,

If we want to do basic matrix-matrix multiplications using CUDA (e.g. C = A*B), we would need to copy the data stored in A and B from the host to the device memory (latency). Furthermore, we would need to utilize tiling techniques via the shared memory to limit the number of global memory accesses. But given the unavoidable latency transferring data from host to device memory, is using CUDA really better than say using OpenMP? If so, how can we a priori know it is better to use CUDA (rather than OpenMP with N cores) without actually writing the CUDA code? At what matrix dimension size (for a given number of OpenMP cores available) is it worth it to use CUDA over OpenMP?

I presume you are asking about dense matrices? If so, the simple way to test is to use CUBLAS, the NVIDIA BLAS, which is pretty optimal for matrix-matrix multiplication in both single and double precision.

On the fastest cards, CUBLAS sgemm() can hit roughly 300 GFlop/s and dgemm() about 80 GFlop/s. The PCI-e transfer overheads mean that it only makes sense for large, “closer to square than column or row” matrices, so a smart implementation would use a size/shape metric to determine when you should do the multiply on the GPU and when you shouldn’t. An even smarter approach is to use overlapping calculations using both host and GPU in parallel using an optimal split based on a metric which accounts for the relative performance of both GPU and CPU and the PCI-e latency, so that some columns of the product are calculated on the GPU and some on the host. Using that sort approach I can get 110 GFlop/s in dgemm() on the fastest card/host combination I have access to.

Edit: this is a figure a posted a while ago showing the relative performance of the GPU, CPU and a roughly optimal hybrid scheme.

Great post. Can you elaborate on why it would make sense for the original matrix to be closer to square? For instance, if we have C = A*B with A = (nxm) and B being (mxn) with n>>m, if m is still sufficiently large enough (e.g. n = 4096 and m = 64), can’t we still get good speedup using a GPU?

It comes down to operation count versus memory and PCI-e latency. Imagine two bounding cases:

  1. (mx1)(1xm) which requires 2m^2 FLOPs, m^2 words of memory for the output matrix, and an O(m^2) transaction on the PCI-e bus to copy the result back to host memory
  2. (mxm)(mxm) which requires 2m^3 FLOP, but also m^2 words of memory for the output matric, and an O(m^2) PCI-e transfer.

The first is effectively gemv, the second gemm. They second one (and cases which approach the second one) are much more likely at any given value of m to be worth offloading to the GPU than the first one, because the ratio of FLOPs to memory requirements and PCI-e overhead are much more favourable.

What about for sparse matrices? Is it worth it to use CUDA for sparse matrix multiplications given that the matrix is square? I suspect the answer is no, since the sparse matrix can be re-organized into unbalanced matrices that resemble vectors, which would make the operation count per memory access count small.

There are CUDA accelerated sparse matrix-vector routines which give good speed up over similar CPU implementations at sensible sizes. See here for more details.