Image convolution with Shared Memory

Hello,

I*m havinbg some problems with a kernel using shared memory

the principle should be like this

on 256x256 image data, i ned to apply a Median Filter

Median Filter takes the Median ( middle Value) of a sorted series of grey values in a 3x3 neighborhood around my target pixel, leaving a margin of one pixel empty

my kernel grid likes like this : GridDim 256 , BlockDim 256, so one block resembles a datarow

this has worked so far, but to speed things up, i did like this

create a shared memory array with 3rows (processing mask) + 1row for results of the calculation on the 3 rows

all calculations are done in shared memory

when the last thread of the block is reached the results of the one result row in shred memory should be transwritten back into global memory

source code is like this:

[codebox]int idx;

extern __shared__  unsigned char sData[];  // Placeholder for 4 rows -> 3 mask, 1 row calculated

 unsigned char * MaskData = (unsigned char *)sData;

 unsigned char * RowData =(unsigned char *)&MaskData[3*Width];

 //unsigned char * NeighborhoodValues =(unsigned char *)&RowData[Width];

idx = (blockIdx.x + 1)  * Width + 1 + threadIdx.x;

if(threadIdx.x == 0)// Fill shared Data sMask

	fillMaskData(data_d_orig,MaskData,3,Width);

//Get the Neighborhood

int pix00 = blockIdx.x * Width  + threadIdx.x; // Upper left

int pix10 = blockIdx.x * Width + 1 + threadIdx.x; // Upper middle

int pix20 = blockIdx.x * Width + 2 + threadIdx.x; // Upper right

int pix01 = (blockIdx.x + 1)  * Width + threadIdx.x; // left

int pix21 = (blockIdx.x + 1)  * Width + 2 + threadIdx.x; // right

int pix02 = (blockIdx.x + 2)  * Width + threadIdx.x; // lower left

int pix12 = (blockIdx.x + 2)  * Width + 1 + threadIdx.x; // lower middle

int pix22 = (blockIdx.x + 2)  * Width + 2 + threadIdx.x;// lower Right

unsigned char values[9]; the neighborhood

values[8]=data_d_orig[pix00];

values[7]=data_d_orig[pix10];

values[6]=data_d_orig[pix20];

values[5]=data_d_orig[pix01];

values[4]=data_d_orig[pix21];

values[3]=data_d_orig[idx];

values[2]=data_d_orig[pix02];

values[1]=data_d_orig[pix12];

values[0]=data_d_orig[pix22];



int length = sizeof(values)/sizeof(unsigned char);

unsigned char MedianValue = getMedianValue(values,length);

RowData[threadIdx.x] = MedianValue;

__syncthreads();

if(threadIdx.x == Width -2) // write data back to gloabl memory

	fillwithRowData(data_d, RowData,1,Width);[/codebox]

methods fillwithRowData()

[codebox]device void fillwithRowData(unsigned char *datatowrite, unsigned char sDataMask ,int offsetrows,int Width){

int start = (blockIdx.x+offsetrows)*Width;

for(int i = 0; i < Width; i++){

	datatowrite[start+i] = sDataMask[i];

	__syncthreads();

}

}[/codebox]

method fillmaskdata ( fills the 3 x rowlngth array)

[codebox]device void fillMaskData(unsigned char * origData, unsigned char sDataMask , int rows,int Width){

int start = blockIdx.x*Width;

int end = start + rows*Width;

for(int i = start,j=0; i < end; i++,j++){

	sDataMask[j] = origData[i];

	__syncthreads();

}

}[/codebox]

the result of the image is that no operation is done at all

i have a similar function taking the data from global memory array just like this and it works fine, obvioulsy somethings wrong with my shared memory calculation, but what??

maybe you can help?

Thanks, best regard

maz