# Dot Product of Matrices

Hi@all,

I would like to calculate the dot productof two matrices and I have no idea how I should do this in an efficient way.

The C code looks like this:

int width = 128;

for(int i=0; i<width; i++)
{
for(int j=0; j<width; j++)
{
for(int n=0; n<400; n++)
{
C[i*width+ j] += A[(iwidth+j) + n] * C[(jwidth+ i) + n];
}
}
}

In Cuda I have the problem, that I’m not able to shift the data from global memory to shared memory, which results in a very low performance:

dim3 dimBlock(128,1);
dim3 dimGridPPA (1, 128);

dot<<<dimGrid, dimBlock>>>(A, B, C);

global void dot(double* A, double* B, double* C)

{

``````//index
const unsigned int tidx = blockDim.x * blockIdx.x +  threadIdx.x;
const unsigned int tidy = blockDim.y * blockIdx.y + threadIdx.y;
const unsigned int tid =  tidy * AnzR + tidx;
const unsigned int row = tidx * NMAX;
const unsigned int col = tidy * NMAX;

double sum = 0;
``````

#pragma loop unroll 400
for(int i=0; i<400; i++)
{
sum = __fma_rn(A[row + i], B[col + i], sum) ;
}

``````     C[tid] = sum;
``````

}

Exits there a paper/documentation about such problems or any suggestions are welcome :)

My graphiccard is a GT425M