Problem with shared memory

I am new to CUDA and am trying to use shared memory in matrix multiplication and the result comes out always 0 (zero). When using global memory the result is correct. My graphic board is NVS3100.

The values of SIZE and Width = 16 - TILE_DIM = 4.

With SHARED MEMORY - Problem
global void MatrixMul(int *Md, int *Nd, int *Pd, int Width) {

int bx = blockIdx.x; 
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;

__shared__ int Ad;
__shared__ int Bd;

int Row = by*TILE_DIM + ty;
int Col = bx*TILE_DIM + tx;

int PValue = 0;

for (int k = 0; k < Width / 2; k++) {
	Ad[ty][tx] = Md[Row*Width + (k*Width + tx)];
	Bd[ty][tx] = Nd[(k*Width + ty)*Width + Col];
	__syncthreads();
}

for (int k = 0; k<Width; ++k){
	PValue += Ad[tx][k] * Bd[k][ty];
	__syncthreads();
}
Pd[Row*Width+Col] = PValue;

}

With GLOBAL MEMORY - OK
global void MatrixMul(int *Md, int *Nd, int *Pd, int Width) {
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by*TILE_DIM + ty;
int Col = bx*TILE_DIM + tx;

int PValue = 0;

for (int k = 0; k < Width; ++k) {
	PValue += Md[Row*Width + k] * Nd[k*Width + Col];
}
__syncthreads();

Pd[Row*Width+Col] = PValue;

}

Helpme, Plaese!!!

perhaps you are out of shared memory
what cc is NVS3100?

see if the kernel even runs - either with proper error checking, or by adding a breakpoint on the first kernel line, and using the debugger

or, perhaps your indices are wrong, when using shared memory
easiest seems to simply add a breakpoint to

PValue += Ad[tx][k] * Bd[k][ty];

and noting whether the value actually increments, for different threads

The breakpoint inside the Kernel does not stop the execution … :-((

add a breakpoint after the kernel launch, in the host code

if that breakpoint is hit, and the debugger does not jump to the breakpoint in the kernel, it is safe to assume the kernel launch failed
you can then verify that via proper error checking - cudaGetLastError(), etc

As little jimmy has indicated, you ought to be doing proper cuda error checking, before asking others for help. Not sure what that is? google “proper cuda error checking” and take the first hit.

You can also try running your code as-is with cuda-memcheck.

I’m using error handling (gpuErrchk) in allocations and transfers Host-> Devide and Device-> Host and the error does not occur. I think the error is in the transfer to the shared memory.

The error checking on allocations and transfers is not sufficient to catch all possible issues. Perhaps you should actually look at proper cuda error checking, and implement it.

Perhaps you should provide a complete code, rather than just the kernel.

Perhaps you should run your code with cuda-memcheck.