Hi all,

I have some code implemented in C++ based on intel mkl, the major calculation in the code is a big matrices multiplication. One matrix (A) is of 17000x17000 which will be multiplied with another matrix (B) of 17000x200. Both A and B are complex matrix of double type. Matrix A is a band matrix with each (main and off-) diagonal is a constant number, just like

A_{n, n} = c0

A_{n, n+1} = c1

A_{n, n+2} = c2

…

I used the following code (matrix and complex number are wrapped by thrust and cusp::complex) to do the multiplication

```
thrust::device_vector A(17000*17000), B(17000*200);
int m = 17000, n=200, k=17000;
cusp::complex<double> alphac = cusp::complex<double>(1, 0);
cusp::complex<double> betac = cusp::complex<double>(0, 0);
status = cublasZgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, // C(m*n) = A(m*k) * B(k*n)
m, n, k,
&alphac,
thrust::raw_pointer_cast(&A[0]),
m,
thrust::raw_pointer_cast(&B[0]),
k,
&betac,
thrust::raw_pointer_cast(&B[0]),
m);
```

I will say this code is faster than the mkl implementation. But I have to repeat the same operation for about 20000 times, so it still to quite long before it’s done. Since I don’t have any experience of CUDA programming before, I just want to ask if this is the right way to implement the matrix multiplication? Since the matrix of A is banded matrix, is that any special thing or function I can use to boost the efficiency? Thanks.