Dot Product of Matrices


I would like to calculate the dot productof two matrices and I have no idea how I should do this in an efficient way.

The C code looks like this:

int width = 128;

for(int i=0; i<width; i++)
for(int j=0; j<width; j++)
for(int n=0; n<400; n++)
C[i*width+ j] += A[(iwidth+j) + n] * C[(jwidth+ i) + n];

In Cuda I have the problem, that I’m not able to shift the data from global memory to shared memory, which results in a very low performance:

dim3 dimBlock(128,1);
dim3 dimGridPPA (1, 128);

dot<<<dimGrid, dimBlock>>>(A, B, C);

global void dot(double* A, double* B, double* C)


const unsigned int tidx = blockDim.x * blockIdx.x +  threadIdx.x;
const unsigned int tidy = blockDim.y * blockIdx.y + threadIdx.y;	
const unsigned int tid =  tidy * AnzR + tidx;
const unsigned int row = tidx * NMAX;
const unsigned int col = tidy * NMAX;

    double sum = 0;

#pragma loop unroll 400
for(int i=0; i<400; i++)
sum = __fma_rn(A[row + i], B[col + i], sum) ;

     C[tid] = sum;


Exits there a paper/documentation about such problems or any suggestions are welcome :)

My graphiccard is a GT425M

Many thanks in advance!

It is surprising how few threads/block you are using. Set it to 1024 or something around 736 if you set the maxreg to 21.

For more details regarding how to do matrix operations, you can take a look at the examples in the SDK. The whitepaper’s quite comprehensive.