Hi everyone,

I once more have some trouble localizing the issue that is provoking a difference between emulation and execution mode.

This time i’m trying to implement a naive implementation of the median filter using a sort algorithm. (kernel size is 3*3, therefore radius is 1).

When using the emulation mode, i’ve got the exact same result, but when i turn to execution. I get very strange results that depend on the block size…

It definitely comes from the function addtri.

I think it originates from the fact that i’m using a intermediate array (float * tab), to sort the values contained in the 1 pixel appron around the value which i’m trying to compute the median value. The same array is used for all the threads… whereas what I would like to happen is for all the threads a unique array is used.

I’ve never been confronted to this type of situation, and can’t think of another solution to avoid this problem. Is atomic operation the solution, or am I totally mistaken ?

I copy-paste the code here, in case you want to see.

```
#define GDATA(base,iDx,iDy,pitch) *((float*)((char*)(base) + (iDy) * pitch) + (iDx)) // Macro used to compute position in 2D
/************************************************************
********************/
/* Function Name : add_tri */
/* Description : Add a new value, and sort (increasing order) */
/************************************************************
********************/
/* Parameters */
/* INPUT : */
/* - float * tab, array from which we compute the median value */
/* - int real_size, length of the array */
/* OUTPUT : */
/* - float * tab, sorted with increasing values */
/************************************************************
********************/
__device__ void
addtri(float *tab, int real_size, float val)
{
int i=0;
float temp;
if (real_size==0){
tab[0]=val;
}else{
while (i < real_size){
if (val < tab[i]){
temp=tab[i];
tab[i]=val;
val=temp;
}
i++;
}
tab[i]=val;
}
}
// end of addtri
/************************************************************
********************/
/* Function Name : KernelMF */
/* Description : Adaptation of the Median filtering to CUDA / Kernel */
/************************************************************
********************/
/* Parameters */
/* INPUT : */
/* - d_idata, input image */
/* - tab , intermediate storage array */
/* - nbcol, nbrow : size of the image */
/* - radius : radius of the median filter kernel */
/* OUTPUT : */
/* - d_odata, output image */
/************************************************************
********************/
__global__ void
Kernel_MF(float * d_idata,size_t pitch_idata,
float * d_odata,size_t pitch_odata,
float * tab,
int nbcol, int nbrow,
int radius)
{
int real_size,i,j,ii,jj, index;
float sum,val;
float median;
const int idx=blockDim.x*blockIdx.x+ threadIdx.x; // Indices pixels
const int idy=blockDim.y*blockIdx.y+ threadIdx.y;
sum=0.0f;
if ((idx<=nbcol-1) && (idy<=nbrow-1)){
real_size=0;
for(j=-radius; j<=radius;j++){
for(i=-radius; i<=radius;i++){
jj=idy+j; ii=idx+i;
if((jj>=0) && (jj<nbrow) && (ii>=0) && (ii<nbcol))
{
val=GDATA(d_idata,ii,jj,pitch_idata);
addtri(tab,real_size,val);
real_size++;
}
}// For
}
index=real_size/2;
median=tab[index];
GDATA(d_odata,idx,idy,pitch_odata)=median;
}
}
```