Optimizing Stack Matrix Multiplication Kernel

Vector11 · September 25, 2009, 9:10pm

Hi,

I wrote this kernel to multiply a system of 3x3 matrices together. Basically its used to combine left and right eigenvectors with eigenvalues. Each thread executes one element of each matrix. I felt pretty clever combining this into one operation, but I am not quite getting the performance I had hoped for. I have written very similar kernels that run 2 to 3 times faster and can’t figure out what is different about this one.

This kernel is not able to fully occupy each SM, but the kernels I am comparing it to use the same occupancy so that is not the problem. I am suspicious that the way I am handling my shared memory is causing bank conflicts and slightly slowing the code down with every memory access. Unfortunately, I am not too familiar with what causes a bank conflict or how to get around it. Does anyone on how to improve the speed of this kernel?

[codebox]// ---------------------- Modified StackMatMul Kernel --------------------------

// Kernel performs matrix multiplication for a stack of 3x3 matrices

// StackMatMul kernel modified to multiply three matrices, one of which is a diagonal

// Each block calculates product for 32 matrix sets

// Block dimensions are (3,3,32) with 288 threads/block

global void StackMatMul(float *a, float *b, float *c, float *diag, int MatNum)

{

// Allocate space for one matrix set to shared memory per thread per block

shared float As[3][3][32];

shared float Bs[3][3][32];

shared float diag_s[3][32];

int ix_i = threadIdx.x;

int iy_i = threadIdx.y;

int iz_i = threadIdx.z;

int idd_i = blockIdx.xblockDim.xblockDim.z + ix_i + 3*iz_i;

int idx_i = blockIdx.xblockDim.xblockDim.yblockDim.z + (ix_i + 3iy_i + 9*iz_i);

// Load a and b from global to shared memory on block

// Each thread loads one element

As[ix_i][iy_i][iz_i] = a[idx_i];

Bs[ix_i][iy_i][iz_i] = b[idx_i];

diag_s[ix_i][iz_i] = diag[idd_i];

__syncthreads();

int ix_o = threadIdx.x;

int iy_o = threadIdx.y;

int iz_o = threadIdx.z;

int idx_o = blockIdx.xblockDim.xblockDim.yblockDim.z + (ix_o + 3iy_o + 9*iz_o);

// Calculate one matrix per thread and write to global memory space

c[idx_o] = As[ix_o][0][iz_o]*diag_s[0][iz_o]*Bs[0][iy_o][iz_o]

       + As[ix_o][1][iz_o]*diag_s[1][iz_o]*Bs[1][iy_o][iz_o]

       + As[ix_o][2][iz_o]*diag_s[2][iz_o]*Bs[2][iy_o][iz_o];

if(idx_o >= 9*MatNum) return; // Prevents overshooting allocated memory

} // End StackedMatMul

// -----------------------------------------------------------------------------[/codebox]

LSChien · September 26, 2009, 1:22am

Quesition: I am suspicious that the way I am handling my shared memory is causing bank conflicts

and slightly slowing the code down with every memory access

How about interchanging position of x and z in shared memory, say

[codebox] shared float As[32][3][3];

shared float Bs[32][3][3];

shared float diag_s[32][3];

…

As[iz_i][iy_i][ix_i] = a[idx_i];

Bs[iz_i][iy_i][ix_i] = b[idx_i];

diag_s[iz_i][ix_i] = diag[idd_i];

…

c[idx_o] = As[iz_o][0][ix_o]*diag_s[iz_o][0]*Bs[iz_o][iy_o][iz_o][0]            

         + As[iz_o][1][ix_o]*diag_s[iz_o][1]*Bs[iz_o][iy_o][iz_o][1]           

         + As[iz_o][2][ix_o]*diag_s[iz_o][2]*Bs[iz_o][iy_o][iz_o][2];  [/codebox]

the remaining codes are the same.

Topic		Replies	Views
Multiplying a system of 3x3 matrices efficiently CUDA Programming and Performance	2	8835	September 11, 2009
Kernel Konfiguration and Runtime CUDA Programming and Performance	6	2455	March 6, 2010
A question about load shared memory in matrix multiplication CUDA Programming and Performance cuda	4	106	December 1, 2024
Bank conflict of tiled matrix multiplication CUDA Programming and Performance cuda , kernel	1	629	May 17, 2023
Ordered Multiplication of various matrices in shared memory greater minds please help CUDA Programming and Performance	10	2387	June 22, 2009
Shared memory matrix multiplication not working CUDA Programming and Performance	6	61	October 11, 2024
Better performance with smaller block size & bank conflicts !!! CUDA Programming and Performance	2	488	September 23, 2017
Matrix Multiplication Bank conflicts problem GPU-Accelerated Libraries	17	1699	August 31, 2018
Matrix Multiplication and Bank conflicts code included CUDA Programming and Performance	3	1350	April 16, 2012
Resolve 1D shared memory bank conflict with paddling CUDA Programming and Performance cuda , kernel	9	75	September 1, 2024

Optimizing Stack Matrix Multiplication Kernel

Related topics