# Optimizing matices multiplications

Hi everyone,

I am newer to CUDA and I am trying to multiplicate complex matrices according the equation C = A’xBxA. A is a 8x1 while B is 8x8, thus the result C is a single value.
The input array A contains 64 matrices 8x8 while B contains 64 vectors of dimension 8x1. The idea is to perform the processing for all the 64 matrices in a single kernel.
The platform that I am using is the Jetson Nano equipped with Tegra X1. Exploiting the profiler, I can see that such kernel takes around 1ms to complete but the theoretical occupancy is only 3.125%.
I would like to improve the occupancy and reduce the processing time. I think that the partial additions of the intermediate products represent the bottleneck of the system.

How can I improve the processing time for such kernel? You can find the code below.

Thank you.

kernel launch:

``````matrix_mul <<<dim3(1,64,1), dim3(8,8,1)>>>(a, b, c);
``````

CUDA kernel:

``````__global__ void matrix_mul(cuComplex *a, cuComplex *b, cuComplex *c)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

__shared__ cuComplex tmp_mul;
__shared__ cuComplex tmp_matr;
__shared__ cuComplex tmp_c;

int z = x +y*8;
int k = y%8 + x*8 + blockIdx.y*64;
int l = x + blockIdx.y*8;

tmp_mul[z] = cuCmulf(cuConjf(b[l]), a[k]);