performance of the matrix multiplication

Kilong · September 8, 2013, 9:32pm

Hi all,
I have some code implemented in C++ based on intel mkl, the major calculation in the code is a big matrices multiplication. One matrix (A) is of 17000x17000 which will be multiplied with another matrix (B) of 17000x200. Both A and B are complex matrix of double type. Matrix A is a band matrix with each (main and off-) diagonal is a constant number, just like

A_{n, n} = c0
A_{n, n+1} = c1
A_{n, n+2} = c2
…

I used the following code (matrix and complex number are wrapped by thrust and cusp::complex) to do the multiplication

thrust::device_vector A(17000*17000), B(17000*200);
  int m = 17000, n=200, k=17000;
  cusp::complex<double> alphac = cusp::complex<double>(1, 0);
  cusp::complex<double> betac = cusp::complex<double>(0, 0);
  status = cublasZgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,   // C(m*n) = A(m*k) * B(k*n)
                       m, n, k, 
                       &alphac,
                       thrust::raw_pointer_cast(&A[0]),
                       m,                       
                       thrust::raw_pointer_cast(&B[0]),
                       k,                           
                       &betac,                          
                       thrust::raw_pointer_cast(&B[0]), 
                       m);

I will say this code is faster than the mkl implementation. But I have to repeat the same operation for about 20000 times, so it still to quite long before it’s done. Since I don’t have any experience of CUDA programming before, I just want to ask if this is the right way to implement the matrix multiplication? Since the matrix of A is banded matrix, is that any special thing or function I can use to boost the efficiency? Thanks.

JFSebastian · September 9, 2013, 2:19pm

I point out just two possibilities that come to my mind:

cuBLAS has routines for matrix-vector multiplications for banded matrices; drawback: in your case, you should run them in a for loop to account for all the needed possibilities;
cuBLAS has batched operations for concurrent multiplications between small-sized matrices; you should segment the original matrix-matrix multiplication in several small-sized multiplications; drawback: you should again use for loops;

Be aware I have never tried those two possibilities before.

Kilong · September 10, 2013, 2:39am

Thanks for the suggestion. I am thinking the second suggestion. I am wondering if you are saying something like stream or not. I used thrust for some while and don’t know much about CUDA programming. I am reading the CUDA doc this day and I saw many introduction on matrix multiplication with CUDA kernel instead of CUBLAS library. So should CUBLAS be faster? It is not many example online to introduce stream for matrix multiplication, I am still looking for example.

njuffa · September 10, 2013, 4:11am

CUBLAS follows the reference BLAS implementation which does not define a function for multiplying a banded matrix by a general matrix. LAPACK does not seem to provide that functionality either, best I can tell. How sparse is the band matrix in question, i.e. how many sub- and superdiagonals does it have? Is it tridiagonal?

As long as you are implementing the desired operation via ZGEMM, you would definitely want to continue using CUBLAS’s version of it. The reason you can find many introductions to CUDA programming that use self-written matrix multiplies as an example is because it is a self-contained and easy to understand task that lends itself well to demonstrating some fundamental CUDA programming techniques.

But creating high-performance matrix multiplies like those provided by CUBLAS is a matter for experts (on any platform).

sBc-Random · September 10, 2013, 6:27am

If the bandwidth is low (<20), I’d recommend using cusparse for the best possible implementation (ignoring of course a custom kernel implementation). For example, a tridiagonal matrix multiply 10^7 * 10^7 10^7 * 32 takes me roughly 0.05s

In your case it’s a toeplitz matrix, which can obviously further simplify calculations

ans = c0 * (D0V) + c1 * (D1 * V) + c2 * (D2 * V) whereby D0…Dn are permutation matrices. Using a custom kernel you could use that simplification to only require reads of a (bandwidth1) c vector and the right hand side matrix, and since banded*general is memory bound in most cases, the absolute best runtime you could get is

~= 1700020082 / (280102410241024) sec ~= 0.1ms. I believe it should be fairly easy to get to 10% of this peak, which is 1ms/calculation.

JFSebastian · September 10, 2013, 11:33am

The batched version of cuBLAS use indeed streams, as long as I know. cusparse is indeed worth trying as pointed out by sBc-Random, I forgot to mention.

njuffa · September 10, 2013, 1:43pm

I second sBc-Random’s suggestion to look into using CUSPARSE, that is the reason I inquired about details with regard to the exact shape / properties of the band matrix.

Kilong · September 10, 2013, 2:18pm

Thanks all. The code was not generated by myself but I read the code carefully, I figure out that the banded matrix is not sparse. Almost all entries are non-zero, maybe it is not a good word to call it band matrix. It is nxn matirx with (n-2)x(n-2) diagonals with each diagonal a constant number.

sBc-Random · September 10, 2013, 3:52pm

That’s a very specialised kind of matmul - you are unlikely to find an optimized solution to that. Keeping in mind you don’t need to store the constant for each diag over and over again, but rather need to store a 2*(n-2) - 1 vector of constants. Once you have done this, you could code a very efficient matmul which would get some 10-20* speedup of gemm, I imagine :)

JFSebastian · September 10, 2013, 4:48pm

Your matrix is then a Toeplitz matrix, see

[url]http://en.wikipedia.org/wiki/Toeplitz_matrix[/url]

Matrix multiplication by a Toeplitz matrix and a dense matrix is equivalent to a cyclic convolution, which can be calculated by a standard FFT, see for example

[url]https://ccrma.stanford.edu/~jos/filters/Cyclic_Convolution_Matrix.html[/url]

Kilong · September 10, 2013, 9:14pm

Thanks

sBc-Random:

If the bandwidth is low (<20), I’d recommend using cusparse for the best possible implementation (ignoring of course a custom kernel implementation). For example, a tridiagonal matrix multiply 10^7 * 10^7 10^7 * 32 takes me roughly 0.05s

In your case it’s a toeplitz matrix, which can obviously further simplify calculations

ans = c0 * (D0V) + c1 * (D1 * V) + c2 * (D2 * V) whereby D0…Dn are permutation matrices. Using a custom kernel you could use that simplification to only require reads of a (bandwidth1) c vector and the right hand side matrix, and since banded*general is memory bound in most cases, the absolute best runtime you could get is

~= 1700020082 / (280102410241024) sec ~= 0.1ms. I believe it should be fairly easy to get to 10% of this peak, which is 1ms/calculation.

Hi sBC-Random, I get the idea of using permutation matrices but I think it will take me some time to understand the math behind. I just wonder the reason why this will enhance the performance since we now have+1 matrices (permutation) and need n+1 multiplication, shall it be slower. Sorry if I misunderstanding some of your information.

BTW, you also mention the toeplitz matrices and FFT, I wonder if this is different scheme than the permutation one or it basically the same thing. I didn’t use FFT before and wonder if this will much faster than the cublas basic matrix multiplication. Again, I believe it will take me some times to learn all those methods.

Kilong · September 10, 2013, 9:25pm

Kilong:

Thanks

sBc-Random:

If the bandwidth is low (<20), I’d recommend using cusparse for the best possible implementation (ignoring of course a custom kernel implementation). For example, a tridiagonal matrix multiply 10^7 * 10^7 10^7 * 32 takes me roughly 0.05s

In your case it’s a toeplitz matrix, which can obviously further simplify calculations

ans = c0 * (D0V) + c1 * (D1 * V) + c2 * (D2 * V) whereby D0…Dn are permutation matrices. Using a custom kernel you could use that simplification to only require reads of a (bandwidth1) c vector and the right hand side matrix, and since banded*general is memory bound in most cases, the absolute best runtime you could get is

~= 1700020082 / (280102410241024) sec ~= 0.1ms. I believe it should be fairly easy to get to 10% of this peak, which is 1ms/calculation.

Hi sBC-Random, I get the idea of using permutation matrices but I think it will take me some time to understand the math behind. I just wonder the reason why this will enhance the performance since we now have+1 matrices (permutation) and need n+1 multiplication, shall it be slower. Sorry if I misunderstanding some of your information.

BTW, you also mention the toeplitz matrices and FFT, I wonder if this is different scheme than the permutation one or it basically the same thing. I didn’t use FFT before and wonder if this will much faster than the cublas basic matrix multiplication. Again, I believe it will take me some times to learn all those methods.

I might not clarify my reply. What I am asking is if we have a nxn toeplitz matrix (A) which multiplies with a nxk matrix (B). As your suggestion, I can store a vector of (2xn+1) elements with each elements represent the constant diagonal element. So AxB can basically be implemented as 2n+1 vector-matrix multiplication. So will all those 2n+1 vector-matrix multiplication be faster than one matrix-matrix multiplication?

JFSebastian · September 10, 2013, 10:13pm

I’m suggesting something different from what sBc-Random is suggesting. Once you have understood the maths behind, interpreting the matrix multiplication between a Toeplitz matrix and a dense one as a convolution and implementing it by FFTs will be as easy as calling a routine of the cuFFT library, if I correctly understand your problem.

Be warned that I’m not claiming that this solution is either simpler or more effective than that suggested by sBc-Random. I’m just saying to think about :-)

Kilong · September 11, 2013, 2:56am

Thanks for that suggestion again. I am reading the math now and will try the code with cuFFT later :)

Kilong · September 11, 2013, 3:12am

sBc-Random:

If the bandwidth is low (<20), I’d recommend using cusparse for the best possible implementation (ignoring of course a custom kernel implementation). For example, a tridiagonal matrix multiply 10^7 * 10^7 10^7 * 32 takes me roughly 0.05s

In your case it’s a toeplitz matrix, which can obviously further simplify calculations

ans = c0 * (D0V) + c1 * (D1 * V) + c2 * (D2 * V) whereby D0…Dn are permutation matrices. Using a custom kernel you could use that simplification to only require reads of a (bandwidth1) c vector and the right hand side matrix, and since banded*general is memory bound in most cases, the absolute best runtime you could get is

I finally got a chance to write a testing code to implement the matrix multiplication on toeplitz matrix and a dense matrix. I write the constant diagonals into a vector and loop it for 2n+1 time with each loop perform a vector and matrix multiplication. I figure that it will be pretty fast for small matrix. But for matrix as big as mine, it is too slow.
~= 1700020082 / (280102410241024) sec ~= 0.1ms. I believe it should be fairly easy to get to 10% of this peak, which is 1ms/calculation.

sBc-Random · September 11, 2013, 3:25am

Kilong,
Your best approach would be to use JFSebastian’s suggestion. Writing your own kernel WOULD produce better results, if you had infinite time to work on it, but I suspect cuFFT will come very, very close with not too much trouble.

It’s not the easiest problem to write your own kernel for. :)

Kilong · September 11, 2013, 3:31am

I agree with that. I am still learning the idea of CUDA programming but honestly, I still don’t understand much about kernel because the CUDA programming is pretty much different from ‘traditional programming skill’ I learnt before :)