Hi,

I have not been using CUDA long and am working on understanding how to program kernels in the most efficient manner. I am currently developing CFD code which requires several operations of 100’s to 1000’s of 3x3 matrices multiplied over the grid. This can be done more efficiently in parallel on the GPU so I wrote a simple kernel to calculate all 9 elements of each matrix per thread. My current version of the code is running 30% faster than the original version of the operation on the CPU. This is my best performing kernel so far. It uses 8 registers per thread but is a shared memory hog (~84 bytes per thread). The amount of shared memory used is limiting me to around 160 - 190 threads per multiprocessor. Does anyone have any suggestions on how to improve the performance of this operation?

[codebox]**global** void StackedMatMul(float* a, float* b, float* c)

{

int i, j, iter;

int idx = blockIdx.x*blockDim.x + threadIdx.x;

for(iter = 0; iter < 10000; iter++) {

// Allocate space for one matrix set to shared memory per thread per block

**shared** float Bs[3][3];

**shared** float As[3][3];

if(idx < np1) {

```
// Load a and b from global to shared memory on block
// Each thread loads one matrix
for(j = 0; j < 3; j++) {
for(i = 0; i < 3; i++) {
As[i][j] = a[i+3*j+9*idx];
Bs[i][j] = b[i+3*j+9*idx];
} // i
} // j
```

// Calculate one matrix per thread and write to global memory space

```
c[0+3*0+9*idx] = As[0][0]*Bs[0][0]+As[0][1]*Bs[1][0]+As[0][2]*Bs[2][0];
c[0+3*1+9*idx] = As[0][0]*Bs[0][1]+As[0][1]*Bs[1][1]+As[0][2]*Bs[2][1];
c[0+3*2+9*idx] = As[0][0]*Bs[0][2]+As[0][1]*Bs[1][2]+As[0][2]*Bs[2][2];
c[1+3*0+9*idx] = As[1][0]*Bs[0][0]+As[1][1]*Bs[1][0]+As[1][2]*Bs[2][0];
c[1+3*1+9*idx] = As[1][0]*Bs[0][1]+As[1][1]*Bs[1][1]+As[1][2]*Bs[2][1];
c[1+3*2+9*idx] = As[1][0]*Bs[0][2]+As[1][1]*Bs[1][2]+As[1][2]*Bs[2][2];
c[2+3*0+9*idx] = As[2][0]*Bs[0][0]+As[2][1]*Bs[1][0]+As[2][2]*Bs[2][0];
c[2+3*1+9*idx] = As[2][0]*Bs[0][1]+As[2][1]*Bs[1][1]+As[2][2]*Bs[2][1];
c[2+3*2+9*idx] = As[2][0]*Bs[0][2]+As[2][1]*Bs[1][2]+As[2][2]*Bs[2][2];
```

__syncthreads();

```
} // End for: iter
```

} // End if[/codebox]

The increased performance is nice, but I was hoping for 70 - 100% improvement in operation time and I know this kernel can be written more efficiently.