Multiplying a system of 3x3 matrices efficiently

Vector11 · September 11, 2009, 4:53pm

Hi,

I have not been using CUDA long and am working on understanding how to program kernels in the most efficient manner. I am currently developing CFD code which requires several operations of 100’s to 1000’s of 3x3 matrices multiplied over the grid. This can be done more efficiently in parallel on the GPU so I wrote a simple kernel to calculate all 9 elements of each matrix per thread. My current version of the code is running 30% faster than the original version of the operation on the CPU. This is my best performing kernel so far. It uses 8 registers per thread but is a shared memory hog (~84 bytes per thread). The amount of shared memory used is limiting me to around 160 - 190 threads per multiprocessor. Does anyone have any suggestions on how to improve the performance of this operation?

[codebox]global void StackedMatMul(float* a, float* b, float* c)

{

int i, j, iter;

int idx = blockIdx.x*blockDim.x + threadIdx.x;

for(iter = 0; iter < 10000; iter++) {

// Allocate space for one matrix set to shared memory per thread per block

shared float Bs[3][3];

shared float As[3][3];

if(idx < np1) {

// Load a and b from global to shared memory on block

// Each thread loads one matrix

for(j = 0; j < 3; j++) {

  for(i = 0; i < 3; i++) {

    As[i][j] = a[i+3*j+9*idx];

    Bs[i][j] = b[i+3*j+9*idx];

  } // i

} // j

// Calculate one matrix per thread and write to global memory space

c[0+3*0+9*idx] = As[0][0]*Bs[0][0]+As[0][1]*Bs[1][0]+As[0][2]*Bs[2][0];

c[0+3*1+9*idx] = As[0][0]*Bs[0][1]+As[0][1]*Bs[1][1]+As[0][2]*Bs[2][1];

c[0+3*2+9*idx] = As[0][0]*Bs[0][2]+As[0][1]*Bs[1][2]+As[0][2]*Bs[2][2];

c[1+3*0+9*idx] = As[1][0]*Bs[0][0]+As[1][1]*Bs[1][0]+As[1][2]*Bs[2][0];

c[1+3*1+9*idx] = As[1][0]*Bs[0][1]+As[1][1]*Bs[1][1]+As[1][2]*Bs[2][1];

c[1+3*2+9*idx] = As[1][0]*Bs[0][2]+As[1][1]*Bs[1][2]+As[1][2]*Bs[2][2];

c[2+3*0+9*idx] = As[2][0]*Bs[0][0]+As[2][1]*Bs[1][0]+As[2][2]*Bs[2][0];

c[2+3*1+9*idx] = As[2][0]*Bs[0][1]+As[2][1]*Bs[1][1]+As[2][2]*Bs[2][1];

c[2+3*2+9*idx] = As[2][0]*Bs[0][2]+As[2][1]*Bs[1][2]+As[2][2]*Bs[2][2];

__syncthreads();

} // End for: iter

} // End if[/codebox]

The increased performance is nice, but I was hoping for 70 - 100% improvement in operation time and I know this kernel can be written more efficiently.

Jeff_Gullett · September 11, 2009, 5:15pm

Yes, the biggest thing that I see is that your global memory accesses are not coalesced. If I were you, I would make each block contain 288 threads (divisible by 9 (matrix size) and 32 (warp size)), where each thread loads a single value into As and Bs, which would be 32 matrices in shared memory (instead of your threadCount # of matrices, which is stressing your shared memory). As and Bs would then be loaded one element at a time based solely on the threadIdx (and blockIdx, but you get the idea). Perform a sync threads, then have each thread compute only a single value in the resulting matrix ‘c’ (again, to coalesce memory accesses, make sure threads write to c sequentially). This should dramatically improve your performance.

-Jeff

Vector11 · September 11, 2009, 11:33pm

Jeff,

Thanks for the advice. That definitely did the trick. My new code did significantly better this time around (x6 faster than the quad core 2Ghz Xeon). Below is a copy of the new version. I can probably still do better and will keep playing with it and welcome more advice.

The test I ran multiplied a set of 8192 3x3 matrices together and looped over this calculation 10,000 times. The CUDA code on the GPU did it in 2.156 seconds while the CPU version did it in 12.156 seconds.

… the kernel…

[codebox]// ----- Kernel performs matrix multiplication for a stack of 3x3 matrices

// ----- Each block calculates product for 32 matrices

global void StackedMatMul(float* a, float* b, float* c)

{

// Allocate space for one matrix set to shared memory per thread per block

shared float As[3][3][32]; //

shared float Bs[3][3][32];

int iter;

int ix_i = threadIdx.x;

int iy_i = threadIdx.y;

int iz_i = threadIdx.z;

int idx_i = blockIdx.xblockDim.xblockDim.yblockDim.z + (ix_i + 3iy_i + 9*iz_i);

// for(iter = 0; iter < 10000; iter++) {

if(idx_i < 9*np1) {

// Load a and b from global to shared memory on block  

// Each thread loads one element

As[ix_i][iy_i][iz_i] = a[idx_i];

Bs[ix_i][iy_i][iz_i] = b[idx_i];

__syncthreads();

} // End if

int ix_o = threadIdx.x;

int iy_o = threadIdx.y;

int iz_o = threadIdx.z;

int idx_o = blockIdx.xblockDim.xblockDim.yblockDim.z + (ix_o + 3iy_o + 9*iz_o);

if(idx_o < 9*np1) {

// Calculate one matrix per thread and write to global memory space

c[idx_o] = As[ix_o][0][iz_o]*Bs[0][iy_o][iz_o]+As[ix_o][1][iz_o]*Bs[1][

iy_o][iz_o]+As[ix_o][2][iz_o]*Bs[2][iy_o][iz_o];

} // End if

__syncthreads();

//} // End for: iter – iter loop for performance testing only

} // End StackedMatMul[/codebox]

… and the portion of the code calling the kernel

np1 = 8192;

threadsPerBlock = 288;

[codebox] dim3 blockSize(3,3,32);

int size = (np1*9)/threadsPerBlock;

dim3 numBlocks = size; // Sets grid size to dim3 type

printf(“size = %d”,size);

pause = getchar();

printf(“… GPU timer begins\n”);

c0 = clock();

for(iter = 0; iter < 10000; iter++) {

// ----- Kernel calls

StackedMatMul<<<numBlocks,blockSize>>>(a_d,b_d,c_d);

} //End for: iter

c1 = clock();

printf(“… GPU finished\n”);[/codebox]

Topic		Replies	Views
Optimizing Stack Matrix Multiplication Kernel CUDA Programming and Performance	1	1898	September 26, 2009
Checking Performance learning how to optimize CUDA codes CUDA Programming and Performance	4	2171	October 7, 2008
Ordered Multiplication of various matrices in shared memory greater minds please help CUDA Programming and Performance	10	2553	June 22, 2009
Matrix multiplication CUDA CUDA Programming and Performance	7	3040	November 12, 2012
Better way to program Cuda CUDA Programming and Performance	3	1233	July 5, 2010
Multiple small matrix multiplication program structure CUDA Programming and Performance	18	7837	April 18, 2010
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7617	September 1, 2010
Matrix - Vector Multiplication Can't get any faster with shared memory CUDA Programming and Performance	4	7226	September 6, 2011
Cuda matrix-multiply, memory coalescing (corner turning) Legacy PGI Compilers cuda	2	1448	June 9, 2021
Need Help Optimizing Kernel CUDA Programming and Performance	6	1149	January 7, 2014

Multiplying a system of 3x3 matrices efficiently

Related topics