 # Optimizing Stack Matrix Multiplication Kernel

Hi,

I wrote this kernel to multiply a system of 3x3 matrices together. Basically its used to combine left and right eigenvectors with eigenvalues. Each thread executes one element of each matrix. I felt pretty clever combining this into one operation, but I am not quite getting the performance I had hoped for. I have written very similar kernels that run 2 to 3 times faster and can’t figure out what is different about this one.

This kernel is not able to fully occupy each SM, but the kernels I am comparing it to use the same occupancy so that is not the problem. I am suspicious that the way I am handling my shared memory is causing bank conflicts and slightly slowing the code down with every memory access. Unfortunately, I am not too familiar with what causes a bank conflict or how to get around it. Does anyone on how to improve the speed of this kernel?

[codebox]// ---------------------- Modified StackMatMul Kernel --------------------------

// Kernel performs matrix multiplication for a stack of 3x3 matrices

// StackMatMul kernel modified to multiply three matrices, one of which is a diagonal

// Each block calculates product for 32 matrix sets

// Block dimensions are (3,3,32) with 288 threads/block

global void StackMatMul(float *a, float *b, float *c, float *diag, int MatNum)

{

// Allocate space for one matrix set to shared memory per thread per block

shared float As;

shared float Bs;

shared float diag_s;

int idd_i = blockIdx.xblockDim.xblockDim.z + ix_i + 3*iz_i;

int idx_i = blockIdx.xblockDim.xblockDim.yblockDim.z + (ix_i + 3iy_i + 9*iz_i);

// Load a and b from global to shared memory on block

As[ix_i][iy_i][iz_i] = a[idx_i];

Bs[ix_i][iy_i][iz_i] = b[idx_i];

diag_s[ix_i][iz_i] = diag[idd_i];

int idx_o = blockIdx.xblockDim.xblockDim.yblockDim.z + (ix_o + 3iy_o + 9*iz_o);

// Calculate one matrix per thread and write to global memory space

c[idx_o] = As[ix_o][iz_o]*diag_s[iz_o]*Bs[iy_o][iz_o]

``````       + As[ix_o][iz_o]*diag_s[iz_o]*Bs[iy_o][iz_o]

+ As[ix_o][iz_o]*diag_s[iz_o]*Bs[iy_o][iz_o];
``````

if(idx_o >= 9*MatNum) return; // Prevents overshooting allocated memory

} // End StackedMatMul

// -----------------------------------------------------------------------------[/codebox]

Quesition: I am suspicious that the way I am handling my shared memory is causing bank conflicts

and slightly slowing the code down with every memory access

How about interchanging position of x and z in shared memory, say

[codebox] shared float As;

shared float Bs;

shared float diag_s;

As[iz_i][iy_i][ix_i] = a[idx_i];

Bs[iz_i][iy_i][ix_i] = b[idx_i];

diag_s[iz_i][ix_i] = diag[idd_i];

``````c[idx_o] = As[iz_o][ix_o]*diag_s[iz_o]*Bs[iz_o][iy_o][iz_o]

+ As[iz_o][ix_o]*diag_s[iz_o]*Bs[iz_o][iy_o][iz_o]

+ As[iz_o][ix_o]*diag_s[iz_o]*Bs[iz_o][iy_o][iz_o];  [/codebox]
``````

the remaining codes are the same.