Optimizing matices multiplications

Hi everyone,

I am newer to CUDA and I am trying to multiplicate complex matrices according the equation C = A’xBxA. A is a 8x1 while B is 8x8, thus the result C is a single value.
The input array A contains 64 matrices 8x8 while B contains 64 vectors of dimension 8x1. The idea is to perform the processing for all the 64 matrices in a single kernel.
The platform that I am using is the Jetson Nano equipped with Tegra X1. Exploiting the profiler, I can see that such kernel takes around 1ms to complete but the theoretical occupancy is only 3.125%.
I would like to improve the occupancy and reduce the processing time. I think that the partial additions of the intermediate products represent the bottleneck of the system.

How can I improve the processing time for such kernel? You can find the code below.

Thank you.

kernel launch:

matrix_mul <<<dim3(1,64,1), dim3(8,8,1)>>>(a, b, c);

CUDA kernel:

__global__ void matrix_mul(cuComplex *a, cuComplex *b, cuComplex *c)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    __shared__ cuComplex tmp_mul[4096];
    __shared__ cuComplex tmp_matr[512];
    __shared__ cuComplex tmp_c[512];
    
    int z = x +y*8;
    int k = y%8 + x*8 + blockIdx.y*64;
    int l = x + blockIdx.y*8;

    tmp_mul[z] = cuCmulf(cuConjf(b[l]), a[k]);
    __syncthreads();

    atomicAdd(&(tmp_matr[y].x),tmp_mul[z].x);
    atomicAdd(&(tmp_matr[y].y),tmp_mul[z].y);

    if (threadIdx.x == 7)
    {
        tmp_c[y] = cuCmulf(tmp_matr[y], b[y]);
        __syncthreads();
        atomicAdd(&(c[y/8].x),tmp_c[y].x);
        atomicAdd(&(c[y/8].y),tmp_c[y].y);
    }
}