Multiplying a system of 3x3 matrices efficiently


I have not been using CUDA long and am working on understanding how to program kernels in the most efficient manner. I am currently developing CFD code which requires several operations of 100’s to 1000’s of 3x3 matrices multiplied over the grid. This can be done more efficiently in parallel on the GPU so I wrote a simple kernel to calculate all 9 elements of each matrix per thread. My current version of the code is running 30% faster than the original version of the operation on the CPU. This is my best performing kernel so far. It uses 8 registers per thread but is a shared memory hog (~84 bytes per thread). The amount of shared memory used is limiting me to around 160 - 190 threads per multiprocessor. Does anyone have any suggestions on how to improve the performance of this operation?

[codebox]global void StackedMatMul(float* a, float* b, float* c)


int i, j, iter;

int idx = blockIdx.x*blockDim.x + threadIdx.x;

for(iter = 0; iter < 10000; iter++) {

// Allocate space for one matrix set to shared memory per thread per block

shared float Bs[3][3];

shared float As[3][3];

if(idx < np1) {

// Load a and b from global to shared memory on block

// Each thread loads one matrix

for(j = 0; j < 3; j++) {

  for(i = 0; i < 3; i++) {

    As[i][j] = a[i+3*j+9*idx];

    Bs[i][j] = b[i+3*j+9*idx];

  } // i

} // j

// Calculate one matrix per thread and write to global memory space

c[0+3*0+9*idx] = As[0][0]*Bs[0][0]+As[0][1]*Bs[1][0]+As[0][2]*Bs[2][0];

c[0+3*1+9*idx] = As[0][0]*Bs[0][1]+As[0][1]*Bs[1][1]+As[0][2]*Bs[2][1];

c[0+3*2+9*idx] = As[0][0]*Bs[0][2]+As[0][1]*Bs[1][2]+As[0][2]*Bs[2][2];

c[1+3*0+9*idx] = As[1][0]*Bs[0][0]+As[1][1]*Bs[1][0]+As[1][2]*Bs[2][0];

c[1+3*1+9*idx] = As[1][0]*Bs[0][1]+As[1][1]*Bs[1][1]+As[1][2]*Bs[2][1];

c[1+3*2+9*idx] = As[1][0]*Bs[0][2]+As[1][1]*Bs[1][2]+As[1][2]*Bs[2][2];

c[2+3*0+9*idx] = As[2][0]*Bs[0][0]+As[2][1]*Bs[1][0]+As[2][2]*Bs[2][0];

c[2+3*1+9*idx] = As[2][0]*Bs[0][1]+As[2][1]*Bs[1][1]+As[2][2]*Bs[2][1];

c[2+3*2+9*idx] = As[2][0]*Bs[0][2]+As[2][1]*Bs[1][2]+As[2][2]*Bs[2][2];


} // End for: iter

} // End if[/codebox]

The increased performance is nice, but I was hoping for 70 - 100% improvement in operation time and I know this kernel can be written more efficiently.

Yes, the biggest thing that I see is that your global memory accesses are not coalesced. If I were you, I would make each block contain 288 threads (divisible by 9 (matrix size) and 32 (warp size)), where each thread loads a single value into As and Bs, which would be 32 matrices in shared memory (instead of your threadCount # of matrices, which is stressing your shared memory). As and Bs would then be loaded one element at a time based solely on the threadIdx (and blockIdx, but you get the idea). Perform a sync threads, then have each thread compute only a single value in the resulting matrix ‘c’ (again, to coalesce memory accesses, make sure threads write to c sequentially). This should dramatically improve your performance.



Thanks for the advice. That definitely did the trick. My new code did significantly better this time around (x6 faster than the quad core 2Ghz Xeon). Below is a copy of the new version. I can probably still do better and will keep playing with it and welcome more advice.

The test I ran multiplied a set of 8192 3x3 matrices together and looped over this calculation 10,000 times. The CUDA code on the GPU did it in 2.156 seconds while the CPU version did it in 12.156 seconds.

… the kernel…

[codebox]// ----- Kernel performs matrix multiplication for a stack of 3x3 matrices

// ----- Each block calculates product for 32 matrices

global void StackedMatMul(float* a, float* b, float* c)


// Allocate space for one matrix set to shared memory per thread per block

shared float As[3][3][32]; //

shared float Bs[3][3][32];

int iter;

int ix_i = threadIdx.x;

int iy_i = threadIdx.y;

int iz_i = threadIdx.z;

int idx_i = blockIdx.xblockDim.xblockDim.yblockDim.z + (ix_i + 3iy_i + 9*iz_i);

// for(iter = 0; iter < 10000; iter++) {

if(idx_i < 9*np1) {

// Load a and b from global to shared memory on block  

// Each thread loads one element

As[ix_i][iy_i][iz_i] = a[idx_i];

Bs[ix_i][iy_i][iz_i] = b[idx_i];


} // End if

int ix_o = threadIdx.x;

int iy_o = threadIdx.y;

int iz_o = threadIdx.z;

int idx_o = blockIdx.xblockDim.xblockDim.yblockDim.z + (ix_o + 3iy_o + 9*iz_o);

if(idx_o < 9*np1) {

// Calculate one matrix per thread and write to global memory space

c[idx_o] = As[ix_o][0][iz_o]*Bs[0][iy_o][iz_o]+As[ix_o][1][iz_o]*Bs[1][


} // End if


//} // End for: iter – iter loop for performance testing only

} // End StackedMatMul[/codebox]

… and the portion of the code calling the kernel

np1 = 8192;

threadsPerBlock = 288;

[codebox] dim3 blockSize(3,3,32);

int size = (np1*9)/threadsPerBlock;

dim3 numBlocks = size; // Sets grid size to dim3 type

printf(“size = %d”,size);

pause = getchar();

printf("… GPU timer begins\n");

c0 = clock();

for(iter = 0; iter < 10000; iter++) {

// ----- Kernel calls


} //End for: iter

c1 = clock();

printf("… GPU finished\n");[/codebox]