using shared memory confused

Hi all, I’m writing a Kernel that performs image dilation. I’m trying to use shared memory, but it’s giving me bad results.

Particularly, it returns all zeros, which should not be happening.

To debug, I tried loading each matrix I’m working with into arrays in the shared memory, then load these shared memory arrays directly to the output matrix. Sometimes this gives the correct output, not sometimes not.

I’ve included this in the comments in my code.

Can you guys help identify the problem? I have no idea what is causing this.

Thanks in advance!!

cudaMemset(result.d_arrayPtr,0,result.width*result.height*sizeof(*result.d_arrayPtr)); //set the output array to 0 in host code

extern __shared__ int sArray[];	//declare shared memory array dynamically

__global__ void dilateOnGPU(matrix *dest, matrix *image, matrix *kernel){

	int down, across, value, maxv, h, w, resultIndex, sampleIndex, kernelIndex, sSubIndex, a_rowOffset, b_rowOffset;


	int* sSub = (int*)&sArray;		//shared memory array for sub-sample

	int* sKernel = (int*)&sArray[256*4];	//shared memory array for kernel

	h=blockIdx.y * blockDim.y + threadIdx.y;

	w=blockIdx.x * blockDim.x + threadIdx.x;

	resultIndex = w+h*dest->width;	//for the grid

	sampleIndex = w+h*image->width;	//for the grid

	sSubIndex = threadIdx.x+threadIdx.y*blockDim.x;		//for every block

	kernelIndex = threadIdx.x+threadIdx.y*kernel->width;	//for every block


	dest->d_arrayPtr[resultIndex]=0; //store result here

	maxv = 0;


	//place a submatrix of image into each block's shared memory

	sSub[sSubIndex] = image->matrixGPUelement(sampleIndex);

	dest->d_arrayPtr[resultIndex]= sSub[sSubIndex];   //I tried just setting the sSub as the output, and see if the program displays the image; it does.


	//place kernel in shared memory

	//one kernel in every block, all kernels fit in block

	if(threadIdx.y < kernel->width && threadIdx.x < kernel->width)

		sKernel[kernelIndex] = kernel->matrixGPUelement(kernelIndex);

	dest->d_arrayPtr[resultIndex]= sKernel[kernelIndex];    //I tried just setting the sKernel as the output, and see if the program displays a kernel in every block, and it does

	dest->d_arrayPtr[resultIndex]= sKernel[kernelIndex];    //however, if I call this line twice, the out displays all zeros!


	//perform dilation in each block. This doesn't work either; I'm getting all zeros in my output.

	//each thread in each block will perform:

	if(threadIdx.x < blockDim.x-kernel->width+1 && threadIdx.y < blockDim.y-kernel->width+1){  //problem with padding

		for(down=0;down<kernel->height;down++){			//multiplies each colum

			a_rowOffset = down*blockDim.x;

			b_rowOffset = down*kernel->width;

			for(across=0;across<kernel->width;across++){        //multiplies each row

				value = sKernel[b_rowOffset+across] * sSub[sSubIndex + a_rowOffset+across];

				if(value>maxv) maxv = value;



		dest->d_arrayPtr[resultIndex]=maxv; //the out displays all zeros, which is wrong.



So I’m guessing this is a memory allocation problem.

Each block is 16x16
imag dimension is 64x64
filt dimension is 3x3

so shared memory should be allocated to (16x16+3x3)*sizeof(int)?