 # Multiplying a system of 3x3 matrices efficiently

Hi,

I have not been using CUDA long and am working on understanding how to program kernels in the most efficient manner. I am currently developing CFD code which requires several operations of 100’s to 1000’s of 3x3 matrices multiplied over the grid. This can be done more efficiently in parallel on the GPU so I wrote a simple kernel to calculate all 9 elements of each matrix per thread. My current version of the code is running 30% faster than the original version of the operation on the CPU. This is my best performing kernel so far. It uses 8 registers per thread but is a shared memory hog (~84 bytes per thread). The amount of shared memory used is limiting me to around 160 - 190 threads per multiprocessor. Does anyone have any suggestions on how to improve the performance of this operation?

[codebox]global void StackedMatMul(float* a, float* b, float* c)

{

int i, j, iter;

int idx = blockIdx.x*blockDim.x + threadIdx.x;

for(iter = 0; iter < 10000; iter++) {

// Allocate space for one matrix set to shared memory per thread per block

shared float Bs;

shared float As;

if(idx < np1) {

``````// Load a and b from global to shared memory on block

for(j = 0; j < 3; j++) {

for(i = 0; i < 3; i++) {

As[i][j] = a[i+3*j+9*idx];

Bs[i][j] = b[i+3*j+9*idx];

} // i

} // j
``````

// Calculate one matrix per thread and write to global memory space

``````c[0+3*0+9*idx] = As*Bs+As*Bs+As*Bs;

c[0+3*1+9*idx] = As*Bs+As*Bs+As*Bs;

c[0+3*2+9*idx] = As*Bs+As*Bs+As*Bs;

c[1+3*0+9*idx] = As*Bs+As*Bs+As*Bs;

c[1+3*1+9*idx] = As*Bs+As*Bs+As*Bs;

c[1+3*2+9*idx] = As*Bs+As*Bs+As*Bs;

c[2+3*0+9*idx] = As*Bs+As*Bs+As*Bs;

c[2+3*1+9*idx] = As*Bs+As*Bs+As*Bs;

c[2+3*2+9*idx] = As*Bs+As*Bs+As*Bs;
``````

``````} // End for: iter
``````

} // End if[/codebox]

The increased performance is nice, but I was hoping for 70 - 100% improvement in operation time and I know this kernel can be written more efficiently.

-Jeff

Jeff,

Thanks for the advice. That definitely did the trick. My new code did significantly better this time around (x6 faster than the quad core 2Ghz Xeon). Below is a copy of the new version. I can probably still do better and will keep playing with it and welcome more advice.

The test I ran multiplied a set of 8192 3x3 matrices together and looped over this calculation 10,000 times. The CUDA code on the GPU did it in 2.156 seconds while the CPU version did it in 12.156 seconds.

… the kernel…

[codebox]// ----- Kernel performs matrix multiplication for a stack of 3x3 matrices

// ----- Each block calculates product for 32 matrices

global void StackedMatMul(float* a, float* b, float* c)

{

// Allocate space for one matrix set to shared memory per thread per block

shared float As; //

shared float Bs;

int iter;

int idx_i = blockIdx.xblockDim.xblockDim.yblockDim.z + (ix_i + 3iy_i + 9*iz_i);

// for(iter = 0; iter < 10000; iter++) {

if(idx_i < 9*np1) {

``````// Load a and b from global to shared memory on block

As[ix_i][iy_i][iz_i] = a[idx_i];

Bs[ix_i][iy_i][iz_i] = b[idx_i];

``````

} // End if

int idx_o = blockIdx.xblockDim.xblockDim.yblockDim.z + (ix_o + 3iy_o + 9*iz_o);

if(idx_o < 9*np1) {

``````// Calculate one matrix per thread and write to global memory space

c[idx_o] = As[ix_o][iz_o]*Bs[iy_o][iz_o]+As[ix_o][iz_o]*Bs[
``````

iy_o][iz_o]+As[ix_o][iz_o]*Bs[iy_o][iz_o];

} // End if

//} // End for: iter – iter loop for performance testing only

} // End StackedMatMul[/codebox]

… and the portion of the code calling the kernel

np1 = 8192;

[codebox] dim3 blockSize(3,3,32);

dim3 numBlocks = size; // Sets grid size to dim3 type

printf(“size = %d”,size);

pause = getchar();

printf("… GPU timer begins\n");

c0 = clock();

for(iter = 0; iter < 10000; iter++) {

``````// ----- Kernel calls

StackedMatMul<<<numBlocks,blockSize>>>(a_d,b_d,c_d);
``````

} //End for: iter

c1 = clock();

printf("… GPU finished\n");[/codebox]