Hi@all,

I would like to calculate the dot productof two matrices and I have no idea how I should do this in an efficient way.

The C code looks like this:

int width = 128;

for(int i=0; i<width; i++)

{

for(int j=0; j<width; j++)

{

for(int n=0; n<400; n++)

{

C[i*width+ j] += A[(i*width+j) + n] * C[(j*width+ i) + n];

}

}

}

In Cuda I have the problem, that I’m not able to shift the data from global memory to shared memory, which results in a very low performance:

dim3 dimBlock(128,1);

dim3 dimGridPPA (1, 128);

dot<<<dimGrid, dimBlock>>>(A, B, C);

**global** void dot(double* A, double* B, double* C)

{

```
//index
const unsigned int tidx = blockDim.x * blockIdx.x + threadIdx.x;
const unsigned int tidy = blockDim.y * blockIdx.y + threadIdx.y;
const unsigned int tid = tidy * AnzR + tidx;
const unsigned int row = tidx * NMAX;
const unsigned int col = tidy * NMAX;
double sum = 0;
```

#pragma loop unroll 400

for(int i=0; i<400; i++)

{

sum = __fma_rn(A[row + i], B[col + i], sum) ;

}

__syncthreads();

```
C[tid] = sum;
```

}

Exits there a paper/documentation about such problems or any suggestions are welcome :)

My graphiccard is a GT425M

Many thanks in advance!