How to improve performance when multiply two matrices with large data in CUDA ?

michael_alex · March 14, 2014, 4:59am

Hi everybody,
I see multiply two matrices with small data in cuda, use share memory to improve performance.

However, when I get large data, I can’t use share memory.
Currently, I use Quadro K4000 with maximum share memory is 48KB, but my data is 500KB.

Any ideas help me to improve multiply two matrices with large data?

Thanks very much.

Robert_Crovella · March 14, 2014, 12:28pm

Use CUBLAS.

And for the simple tiled shared memory implementations of matrix multiply, only a tile is copied into shared memory at a time, so the size of shared memory doesn’t affect the problem size. The size of shared memory only limits the tile size, which is limited by the threadblock (threads per block) anyway.

michael_alex · March 19, 2014, 4:51am

Thank for your answer !

My real data type is cuDoubleComplex type

+) matrix A with row = 160, col = 990
+) matrix B with row = 990, col = 160

And I have used cublasZgemm to multiply 2 matrices: A & B. However, performance is not increase (compare with normal method multiply 2 matrices, in here I don’t use tile shared memory because
shared data is very large).

Can you have another idea ?
Can everyone help me ?

Thanks!

Robert_Crovella · March 19, 2014, 5:29am

What is “normal method multiply 2 matrices” ?

Do you mean multiplication on the host?

With CUDA GPUs, anyway, I don’t think you’re going to find a method faster than CUBLAS to multiply matrices.

michael_alex · March 19, 2014, 8:34am

This is my function to multiply 2 matrices:

+) matrixA with row = 160, col = 990
+) matrixB with row = 990, col = 160

//Kernel on Device

global void cudaMatrixMultiply(cuDoubleComplex* MatriA, cuDoubleComplex* MatrixB, cuDoubleComplex* MatrixResult, int tile_width)
{

extern __shared__ cuDoubleComplex tempResult[];	

int id = blockDim.x * threadIdx.y + threadIdx.x;

    tempResult[id] = make_cuDoubleComplex(0,0);

int col = blockIdx.x*tile_width + threadIdx.x;
int row = blockIdx.y*tile_width + threadIdx.y;

 if(col < 160&& row < 160) {

	for (int k = 0; k < 990; k++){

	tempResult[id] = cuCfma(MatrixB[row * 160+ k], MatrixA[k * 160+ col], tempResult[id]);

	}

   dev_A[row * 160+ col] = tempResult[id];

  }

}

And call kernel on Host:

void MatrixMultiply( cuDoubleComplex* MatrixA, cuDoubleComplex* MatrixB, cuDoubleComplex* MatrixResult, cudaStream_t stream1)
{
int tile_width = 16;

dim3 dimBlock(tile_width, tile_width);

dim3 dimGrid(160/dimBlock.x, 160/dimBlock.y);

cudaMatrixMultiply<<<dimGrid, dimBlock, tile_width * tile_width * sizeof(cuDoubleComplex),     stream1>>>(MatrixA, MatrixB, MatrixResult, tile_width);

}

scottgray · March 19, 2014, 12:33pm

While cublas will almost certainly get you the best performance, it is worthwhile to understand the underlying algorithms in case you need to write code that works outside of the designs of the given libraries. For gemm, here are the best sources of information on that:

[url]Programming Guide :: CUDA Toolkit Documentation

http://asg.ict.ac.cn/projects/dgemm/sc11_dgemm.pdf

And as txbob noted above these all use a fixed amount of shared memory, independent of the matrix sizes. That fixed size is only dependent on the width of your blocks and the amount of loop unrolling you want.

If you’re not seeing performance improvements with cublas then it probably means kernel initialization and memcopies are overriding your computational gains. Keep the memory on the device for additional calculations, or setup a pipeline of work to be doing concurrent coping and computing. Benchmarking a single 160x990 MM isn’t going to tell you much unless you’re timing just the kernel call on the device itself (cudaEventRecord, cudaEventElapsedTime).

Topic		Replies	Views
matrix multiplication for large matrices CUDA Programming and Performance	3	1608	August 22, 2011
Optimize problem regarding problem size CUDA Programming and Performance	4	6144	May 25, 2011
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18222	March 30, 2011
cublas large matrix multiplication large matrices won't compute CUDA Programming and Performance	4	3546	January 17, 2008
CUBLAS matrix multiplication matrix size limited by GPU memory size CUDA Programming and Performance	8	3565	August 2, 2010
Matrix-Matrix Multiplication Accuracy and Performance Questions CUDA Programming and Performance	13	6612	April 16, 2007
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7536	September 1, 2010
Matrix Multiplication with Shared Memory CUDA Programming and Performance	0	1358	September 28, 2009
System reserved shared memory? How can it be possible? CUDA Programming and Performance	5	1521	July 8, 2010
Faster MatrixMult than CUBLAS! CUDA Programming and Performance	4	2820	September 4, 2009

How to improve performance when multiply two matrices with large data in CUDA ?

Related topics