Hi everyone,

I am newer to CUDA and I am trying to multiplicate complex matrices according the equation C = A’xBxA. A is a 8x1 while B is 8x8, thus the result C is a single value.

The input array A contains 64 matrices 8x8 while B contains 64 vectors of dimension 8x1. The idea is to perform the processing for all the 64 matrices in a single kernel.

The platform that I am using is the Jetson Nano equipped with Tegra X1. Exploiting the profiler, I can see that such kernel takes around 1ms to complete but the theoretical occupancy is only 3.125%.

I would like to improve the occupancy and reduce the processing time. I think that the partial additions of the intermediate products represent the bottleneck of the system.

How can I improve the processing time for such kernel? You can find the code below.

Thank you.

kernel launch:

```
matrix_mul <<<dim3(1,64,1), dim3(8,8,1)>>>(a, b, c);
```

CUDA kernel:

```
__global__ void matrix_mul(cuComplex *a, cuComplex *b, cuComplex *c)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
__shared__ cuComplex tmp_mul[4096];
__shared__ cuComplex tmp_matr[512];
__shared__ cuComplex tmp_c[512];
int z = x +y*8;
int k = y%8 + x*8 + blockIdx.y*64;
int l = x + blockIdx.y*8;
tmp_mul[z] = cuCmulf(cuConjf(b[l]), a[k]);
__syncthreads();
atomicAdd(&(tmp_matr[y].x),tmp_mul[z].x);
atomicAdd(&(tmp_matr[y].y),tmp_mul[z].y);
if (threadIdx.x == 7)
{
tmp_c[y] = cuCmulf(tmp_matr[y], b[y]);
__syncthreads();
atomicAdd(&(c[y/8].x),tmp_c[y].x);
atomicAdd(&(c[y/8].y),tmp_c[y].y);
}
}
```