Hello,
I*m havinbg some problems with a kernel using shared memory
the principle should be like this
on 256x256 image data, i ned to apply a Median Filter
Median Filter takes the Median ( middle Value) of a sorted series of grey values in a 3x3 neighborhood around my target pixel, leaving a margin of one pixel empty
my kernel grid likes like this : GridDim 256 , BlockDim 256, so one block resembles a datarow
this has worked so far, but to speed things up, i did like this
create a shared memory array with 3rows (processing mask) + 1row for results of the calculation on the 3 rows
all calculations are done in shared memory
when the last thread of the block is reached the results of the one result row in shred memory should be transwritten back into global memory
source code is like this:
[codebox]int idx;
extern __shared__ unsigned char sData[]; // Placeholder for 4 rows -> 3 mask, 1 row calculated
unsigned char * MaskData = (unsigned char *)sData;
unsigned char * RowData =(unsigned char *)&MaskData[3*Width];
//unsigned char * NeighborhoodValues =(unsigned char *)&RowData[Width];
idx = (blockIdx.x + 1) * Width + 1 + threadIdx.x;
if(threadIdx.x == 0)// Fill shared Data sMask
fillMaskData(data_d_orig,MaskData,3,Width);
//Get the Neighborhood
int pix00 = blockIdx.x * Width + threadIdx.x; // Upper left
int pix10 = blockIdx.x * Width + 1 + threadIdx.x; // Upper middle
int pix20 = blockIdx.x * Width + 2 + threadIdx.x; // Upper right
int pix01 = (blockIdx.x + 1) * Width + threadIdx.x; // left
int pix21 = (blockIdx.x + 1) * Width + 2 + threadIdx.x; // right
int pix02 = (blockIdx.x + 2) * Width + threadIdx.x; // lower left
int pix12 = (blockIdx.x + 2) * Width + 1 + threadIdx.x; // lower middle
int pix22 = (blockIdx.x + 2) * Width + 2 + threadIdx.x;// lower Right
unsigned char values[9]; the neighborhood
values[8]=data_d_orig[pix00];
values[7]=data_d_orig[pix10];
values[6]=data_d_orig[pix20];
values[5]=data_d_orig[pix01];
values[4]=data_d_orig[pix21];
values[3]=data_d_orig[idx];
values[2]=data_d_orig[pix02];
values[1]=data_d_orig[pix12];
values[0]=data_d_orig[pix22];
int length = sizeof(values)/sizeof(unsigned char);
unsigned char MedianValue = getMedianValue(values,length);
RowData[threadIdx.x] = MedianValue;
__syncthreads();
if(threadIdx.x == Width -2) // write data back to gloabl memory
fillwithRowData(data_d, RowData,1,Width);[/codebox]
methods fillwithRowData()
[codebox]device void fillwithRowData(unsigned char *datatowrite, unsigned char sDataMask ,int offsetrows,int Width){
int start = (blockIdx.x+offsetrows)*Width;
for(int i = 0; i < Width; i++){
datatowrite[start+i] = sDataMask[i];
__syncthreads();
}
}[/codebox]
method fillmaskdata ( fills the 3 x rowlngth array)
[codebox]device void fillMaskData(unsigned char * origData, unsigned char sDataMask , int rows,int Width){
int start = blockIdx.x*Width;
int end = start + rows*Width;
for(int i = start,j=0; i < end; i++,j++){
sDataMask[j] = origData[i];
__syncthreads();
}
}[/codebox]
the result of the image is that no operation is done at all
i have a similar function taking the data from global memory array just like this and it works fine, obvioulsy somethings wrong with my shared memory calculation, but what??
maybe you can help?
Thanks, best regard
maz