How to improve performance when multiply two matrices with large data in CUDA ?

Hi everybody,
I see multiply two matrices with small data in cuda, use share memory to improve performance.

However, when I get large data, I can’t use share memory.
Currently, I use Quadro K4000 with maximum share memory is 48KB, but my data is 500KB.

Any ideas help me to improve multiply two matrices with large data?

Thanks very much.


And for the simple tiled shared memory implementations of matrix multiply, only a tile is copied into shared memory at a time, so the size of shared memory doesn’t affect the problem size. The size of shared memory only limits the tile size, which is limited by the threadblock (threads per block) anyway.

Thank for your answer !

My real data type is cuDoubleComplex type

+) matrix A with row = 160, col = 990
+) matrix B with row = 990, col = 160

And I have used cublasZgemm to multiply 2 matrices: A & B. However, performance is not increase (compare with normal method multiply 2 matrices, in here I don’t use tile shared memory because
shared data is very large).

Can you have another idea ?
Can everyone help me ?


What is “normal method multiply 2 matrices” ?

Do you mean multiplication on the host?

With CUDA GPUs, anyway, I don’t think you’re going to find a method faster than CUBLAS to multiply matrices.

This is my function to multiply 2 matrices:

+) matrixA with row = 160, col = 990
+) matrixB with row = 990, col = 160

//Kernel on Device

global void cudaMatrixMultiply(cuDoubleComplex* MatriA, cuDoubleComplex* MatrixB, cuDoubleComplex* MatrixResult, int tile_width)

extern __shared__ cuDoubleComplex tempResult[];	

int id = blockDim.x * threadIdx.y + threadIdx.x;

    tempResult[id] = make_cuDoubleComplex(0,0);

int col = blockIdx.x*tile_width + threadIdx.x;
int row = blockIdx.y*tile_width + threadIdx.y;

 if(col < 160&& row < 160) {

	for (int k = 0; k < 990; k++){

	tempResult[id] = cuCfma(MatrixB[row * 160+ k], MatrixA[k * 160+ col], tempResult[id]);


   dev_A[row * 160+ col] = tempResult[id];



And call kernel on Host:

void MatrixMultiply( cuDoubleComplex* MatrixA, cuDoubleComplex* MatrixB, cuDoubleComplex* MatrixResult, cudaStream_t stream1)
int tile_width = 16;

dim3 dimBlock(tile_width, tile_width);

dim3 dimGrid(160/dimBlock.x, 160/dimBlock.y);

cudaMatrixMultiply<<<dimGrid, dimBlock, tile_width * tile_width * sizeof(cuDoubleComplex),     stream1>>>(MatrixA, MatrixB, MatrixResult, tile_width);


While cublas will almost certainly get you the best performance, it is worthwhile to understand the underlying algorithms in case you need to write code that works outside of the designs of the given libraries. For gemm, here are the best sources of information on that:

And as txbob noted above these all use a fixed amount of shared memory, independent of the matrix sizes. That fixed size is only dependent on the width of your blocks and the amount of loop unrolling you want.

If you’re not seeing performance improvements with cublas then it probably means kernel initialization and memcopies are overriding your computational gains. Keep the memory on the device for additional calculations, or setup a pipeline of work to be doing concurrent coping and computing. Benchmarking a single 160x990 MM isn’t going to tell you much unless you’re timing just the kernel call on the device itself (cudaEventRecord, cudaEventElapsedTime).