Matrix Multiplications, Latency, CUDA vs OpenMP

BlahCuda · January 8, 2010, 12:54am

Hi everyone,

If we want to do basic matrix-matrix multiplications using CUDA (e.g. C = A*B), we would need to copy the data stored in A and B from the host to the device memory (latency). Furthermore, we would need to utilize tiling techniques via the shared memory to limit the number of global memory accesses. But given the unavoidable latency transferring data from host to device memory, is using CUDA really better than say using OpenMP? If so, how can we a priori know it is better to use CUDA (rather than OpenMP with N cores) without actually writing the CUDA code? At what matrix dimension size (for a given number of OpenMP cores available) is it worth it to use CUDA over OpenMP?

avidday · January 8, 2010, 9:05am

I presume you are asking about dense matrices? If so, the simple way to test is to use CUBLAS, the NVIDIA BLAS, which is pretty optimal for matrix-matrix multiplication in both single and double precision.

On the fastest cards, CUBLAS sgemm() can hit roughly 300 GFlop/s and dgemm() about 80 GFlop/s. The PCI-e transfer overheads mean that it only makes sense for large, “closer to square than column or row” matrices, so a smart implementation would use a size/shape metric to determine when you should do the multiply on the GPU and when you shouldn’t. An even smarter approach is to use overlapping calculations using both host and GPU in parallel using an optimal split based on a metric which accounts for the relative performance of both GPU and CPU and the PCI-e latency, so that some columns of the product are calculated on the GPU and some on the host. Using that sort approach I can get 110 GFlop/s in dgemm() on the fastest card/host combination I have access to.

Edit: this is a figure a posted a while ago showing the relative performance of the GPU, CPU and a roughly optimal hybrid scheme.
Post

BlahCuda · January 8, 2010, 5:15pm

I presume you are asking about dense matrices? If so, the simple way to test is to use CUBLAS, the NVIDIA BLAS, which is pretty optimal for matrix-matrix multiplication in both single and double precision.

On the fastest cards, CUBLAS sgemm() can hit roughly 300 GFlop/s and dgemm() about 80 GFlop/s. The PCI-e transfer overheads mean that it only makes sense for large, “closer to square than column or row” matrices, so a smart implementation would use a size/shape metric to determine when you should do the multiply on the GPU and when you shouldn’t. An even smarter approach is to use overlapping calculations using both host and GPU in parallel using an optimal split based on a metric which accounts for the relative performance of both GPU and CPU and the PCI-e latency, so that some columns of the product are calculated on the GPU and some on the host. Using that sort approach I can get 110 GFlop/s in dgemm() on the fastest card/host combination I have access to.

Edit: this is a figure a posted a while ago showing the relative performance of the GPU, CPU and a roughly optimal hybrid scheme.

Post

Great post. Can you elaborate on why it would make sense for the original matrix to be closer to square? For instance, if we have C = A*B with A = (nxm) and B being (mxn) with n>>m, if m is still sufficiently large enough (e.g. n = 4096 and m = 64), can’t we still get good speedup using a GPU?

avidday · January 8, 2010, 5:55pm

It comes down to operation count versus memory and PCI-e latency. Imagine two bounding cases:

(mx1)(1xm) which requires 2m^2 FLOPs, m^2 words of memory for the output matrix, and an O(m^2) transaction on the PCI-e bus to copy the result back to host memory
(mxm)(mxm) which requires 2m^3 FLOP, but also m^2 words of memory for the output matric, and an O(m^2) PCI-e transfer.

The first is effectively gemv, the second gemm. They second one (and cases which approach the second one) are much more likely at any given value of m to be worth offloading to the GPU than the first one, because the ratio of FLOPs to memory requirements and PCI-e overhead are much more favourable.

BlahCuda · January 8, 2010, 6:22pm

What about for sparse matrices? Is it worth it to use CUDA for sparse matrix multiplications given that the matrix is square? I suspect the answer is no, since the sparse matrix can be re-organized into unbalanced matrices that resemble vectors, which would make the operation count per memory access count small.

avidday · January 8, 2010, 8:01pm

There are CUDA accelerated sparse matrix-vector routines which give good speed up over similar CPU implementations at sensible sizes. See here for more details.

Topic		Replies	Views
Question about Pinned memory CUDA Programming and Performance	8	1896	June 16, 2016
Iterative Matrix Methods CUDA Programming and Performance	7	4609	December 11, 2008
best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about CUDA Programming and Performance	6	5599	August 12, 2009
When to use Serial CPU, CUDA, OpenMP and MPI? CUDA Programming and Performance	8	13495	May 29, 2021
matrix multiplication with large dimensions CUDA Programming and Performance	7	1580	April 9, 2011
CUDA stand-alone version of dense matrix-vector multiplication CUDA Programming and Performance	4	1023	May 4, 2022
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18198	March 30, 2011
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10029	March 24, 2014
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5587	November 25, 2009
Matrix-Matrix Multiplication Accuracy and Performance Questions CUDA Programming and Performance	13	6573	April 16, 2007

Matrix Multiplications, Latency, CUDA vs OpenMP

Related topics