I am new to CUDA and am trying to use shared memory in matrix multiplication and the result comes out always 0 (zero). When using global memory the result is correct. My graphic board is NVS3100.

The values of SIZE and Width = 16 - TILE_DIM = 4.

With SHARED MEMORY - Problem

**global** void MatrixMul(int *Md, int *Nd, int *Pd, int Width) {

```
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
__shared__ int Ad;
__shared__ int Bd;
int Row = by*TILE_DIM + ty;
int Col = bx*TILE_DIM + tx;
int PValue = 0;
for (int k = 0; k < Width / 2; k++) {
Ad[ty][tx] = Md[Row*Width + (k*Width + tx)];
Bd[ty][tx] = Nd[(k*Width + ty)*Width + Col];
__syncthreads();
}
for (int k = 0; k<Width; ++k){
PValue += Ad[tx][k] * Bd[k][ty];
__syncthreads();
}
Pd[Row*Width+Col] = PValue;
```

}

With GLOBAL MEMORY - OK

**global** void MatrixMul(int *Md, int *Nd, int *Pd, int Width) {

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

```
int Row = by*TILE_DIM + ty;
int Col = bx*TILE_DIM + tx;
int PValue = 0;
for (int k = 0; k < Width; ++k) {
PValue += Md[Row*Width + k] * Nd[k*Width + Col];
}
__syncthreads();
Pd[Row*Width+Col] = PValue;
```

}

Helpme, Plaese!!!