Naive Median filter Implementation Difference between Emulation and Execution

Hi everyone,

I once more have some trouble localizing the issue that is provoking a difference between emulation and execution mode.

This time i’m trying to implement a naive implementation of the median filter using a sort algorithm. (kernel size is 3*3, therefore radius is 1).

When using the emulation mode, i’ve got the exact same result, but when i turn to execution. I get very strange results that depend on the block size…

It definitely comes from the function addtri.

I think it originates from the fact that i’m using a intermediate array (float * tab), to sort the values contained in the 1 pixel appron around the value which i’m trying to compute the median value. The same array is used for all the threads… whereas what I would like to happen is for all the threads a unique array is used.

I’ve never been confronted to this type of situation, and can’t think of another solution to avoid this problem. Is atomic operation the solution, or am I totally mistaken ?

I copy-paste the code here, in case you want to see.

#define GDATA(base,iDx,iDy,pitch) *((float*)((char*)(base) + (iDy) * pitch) + (iDx)) // Macro used to compute position in 2D

/************************************************************

********************/

/* Function Name : add_tri													   */

/* Description : Add a new value, and sort (increasing order)											   */

/************************************************************

********************/

/* Parameters																	*/

/* INPUT :																		*/

/*	- float * tab, array from which we compute the median value					*/

/*	- int real_size, length of the array 										*/

/* OUTPUT :																		*/

/* - float * tab, sorted with increasing values									*/

/************************************************************

********************/

__device__ void

addtri(float *tab,  int real_size, float val)

{

	int i=0;

	float temp;

	if (real_size==0){

		tab[0]=val;

	}else{

		while (i < real_size){

			if (val < tab[i]){

				temp=tab[i];

				tab[i]=val;

				val=temp;

			}

			i++;

		}

		tab[i]=val;

	}

}

// end of addtri

/************************************************************

********************/

/* Function Name : KernelMF												   */

/* Description : Adaptation of the Median filtering to CUDA / Kernel							   */

/************************************************************

********************/

/* Parameters																	*/

/* INPUT :																		*/

/* - d_idata, input image														*/

/* - tab , intermediate storage array											*/

/* - nbcol, nbrow : size of the image											*/

/* - radius : radius of the median filter kernel								*/

/* OUTPUT :																		*/

/* - d_odata, output image														*/

/************************************************************

********************/

__global__ void

Kernel_MF(float * d_idata,size_t pitch_idata,

		  float * d_odata,size_t pitch_odata,

		  float * tab,

		  int nbcol, int nbrow,

		  int radius)

{

	int real_size,i,j,ii,jj, index;

	float sum,val;

	float median;

	const int idx=blockDim.x*blockIdx.x+ threadIdx.x;	// Indices pixels

	const int idy=blockDim.y*blockIdx.y+ threadIdx.y;

	

	sum=0.0f;

	if ((idx<=nbcol-1) && (idy<=nbrow-1)){

		real_size=0;

		for(j=-radius; j<=radius;j++){	

			for(i=-radius; i<=radius;i++){

			jj=idy+j; ii=idx+i;

			if((jj>=0) && (jj<nbrow) && (ii>=0) && (ii<nbcol))

			{

				val=GDATA(d_idata,ii,jj,pitch_idata);				

				addtri(tab,real_size,val);

				real_size++;

			}

		}// For

	}

	index=real_size/2;

	median=tab[index];

	GDATA(d_odata,idx,idy,pitch_odata)=median;

		

	}

}

I might be misinterpreting something here, but don’t you have a huge data inconsistency problem with tab? It looks like every thread you launch is going to do that in-place sort on tab simultaneously. The only way that will work as written is if you can guarantee that the memory access operations with the sort are serialized (which it implicitly is in emulation). On the device that is going to be very hard to achieve using the algorithm as it stands.

Yes that’s exactly what I was suspecting here, and I didn’t know exactly how to solve it. Using serialized access will slow down considerably the algorithm (it would be about the same as in the CPU version).

I thought of another solution. Let’s say that I want to filter an image of size (nbcol,nbrow)

Instead of having a single array of size 99sizeof(float), I could use a 3D tab of size 99sizeof(float)nbcolnbrow in global memory, which would allow me to store every sub-array used to find the median for each pixel of the image. (254803,968 kBytes).

A softer version is also conceivable, using shared memory.
If the size of a block is 16*16 for example. I would need two arrays in shared memory

  • a first tab of size (16+2)(16+2)sizeof(float) (the +2 is to load the appron (it is a 33 median filter), the content would be loaded as in the separable 2D convolution as explained in the SDK))
  • a second tab of size 16169*sizeof(float) to store the neighbors needed to compute the median; Therefore I wouldn’t need to serialize the access since there are enough storage so that the threads don’t get in the way of each other.

i’m gonna try this solution. But if anyone had a smarter version in mind, i’m all ears !

In brief for those who could be later interested.
The version with global memory was using too much resources from global memory and wouldn’t launch, on the other hand the version using shared memory and a 3D tab or size block_xblock_y9 worked perfectly fine, and accelerated the processus by a factor of 5.90.