Optimize bilateral filter

dorstrem · February 8, 2010, 8:10am

Hi all!

I’m using this bilateral filter in my project, but this function is not optimized at all.

Is there someone who can help me to optimize it ?

Using shared memory or other, I don’t understand how to use it.

Thanks!

********************************************************/

/*  Name kernelBF.cu									*/

/*  Last date of modification : 20/04/09				*/

#ifndef _KERNEL_H_

#define _KERNEL_H_

#define GDATA(base,iDx,iDy,pitch) *((float*)((char*)(base) + (iDy) * pitch) + (iDx))  // used to cooperate with the pitch created with CudaMallocPitch

__global__ void

KernelBF(float* d_idata, size_t pitch_in, 

		 U16 * d_RGBmap,  // U16 == (unsigned short int)

		 float* d_odata, size_t pitch_out,

		 unsigned int width, unsigned int height,

		 float * d_skernel) 

/* Kernel Parameters */

/* d_idata : disparity map before filtering, pitch_idata: pitch created with CudaMallocPitch

/* d_odata : disparity map after filtering, pitch_pdata: pitch created with Cudamallocpitch

/* d_RGBmap : Color map (R,G,B) size  2*width*height, pitch_color: pitch created with CudaMallocPitch

/* width : largeur, height : hauteur (nbcol, nbrow) 

/* d_skernel : kernel computed on the host side containing the coefficients for spatial filtering

{

	/********* Variables used in BF computation ********/	

	

	float sig_color = 200.0f; // value distance sigma

	float sig_disp = 200.0f; // disparity distance sigma

	//	float sig_spatial = 200.0f; 

	

	

	float d_disp=0.0f,d_color=0.0f;

	float w_spatial=1.0f, w_color=1.0f,w_disp=1.0f,w_final=1.0f;

	// the spatial distance sigma is defined in BF.cu

	

	// Thread Index

	const int tix=threadIdx.x;

	const int tiy=threadIdx.y;

	// Global Index

	int idx=blockIdx.x*blockDim.x+tix;

	int idy=blockIdx.y*blockDim.y+tiy;

	if ((idx <width)&&(idy<height)){

		float sum=0.0f;

		float w_t=0.0f;

		for (int i=-KERNEL_RADIUS;i<=KERNEL_RADIUS;i++){

			for (int j=-KERNEL_RADIUS;j<=KERNEL_RADIUS;j++){

				if ( ((idx+i)>=0)&&((idy+j)>=0)&&((idx+i)<width)&&((idy+j)<height)){

					

					w_spatial=d_skernel[KERNEL_RADIUS-i+(KERNEL_W)*(KERNEL_RADIUS-j)];

					d_color=(float) (	abs(d_RGBmap[(idx+width*idy)*3]-d_RGBmap[(idx+i+width*(idy+j))*3])+

										abs(d_RGBmap[(idx+width*idy)*3+1]-d_RGBmap[(idx+i+width*(idy+j))*3+1])+

										abs(d_RGBmap[(idx+width*idy)*3+2]-d_RGBmap[(idx+i+width*(idy+j))*3+2]));

					d_color*=d_color;

					w_color=exp(-d_color/sig_color);

					d_disp=(GDATA(d_idata,idx+i,idy+j,pitch_in)-GDATA(d_idata,idx,idy,pitch_in));

					d_disp*=d_disp;

					w_disp=exp(-d_disp/sig_disp);	

					w_final=w_spatial*w_color*w_disp;

					w_t+=w_final;

					sum+=w_final*GDATA(d_idata,idx+i,idy+j,pitch_in);

				}

				

			}// for j

		}// for i

		if (w_t !=0 ){ 

			GDATA(d_odata,idx,idy,pitch_out)= (float)(sum/w_t);

		}else{ GDATA(d_odata,idx,idy,pitch_out)=0.0f;}

			

		}// if*/

										

	

	

}

#endif // #ifndef _TEMPLATE_KERNEL_H_

dorstrem · February 8, 2010, 1:15pm

UP !

eyalhir74 · February 8, 2010, 1:27pm

Hi,

It seems the first place to start would be to reduce the number of global memory reads you’re doing.

For example, the following code:

for (int i=-KERNEL_RADIUS;i<=KERNEL_RADIUS;i++)

{

   for (int j=-KERNEL_RADIUS;j<=KERNEL_RADIUS;j++)

   {

	  w_spatial=d_skernel[KERNEL_RADIUS-i+(KERNEL_W)*(KERNEL_RADIUS-j)];

	  ....

   }

}

It would probably be much better to read the data from d_skernel into shared memory and have the threads/different

iterations of i and j use the data in the shared memory, rather than reading the same data over and over again.

In case KERNEL_RADIUS and KERNEL_W are very big, you can split the shared memory reads into chunks or other

startegies…

Same goes to this:

d_color=(float) (	abs(d_RGBmap[(idx+width*idy)*3]-d_RGBmap[(idx+i+width*(idy+j))*3])+

		abs(d_RGBmap[(idx+width*idy)*3+1]-d_RGBmap[(idx+i+width*(idy+j))*3+1])+

		abs(d_RGBmap[(idx+width*idy)*3+2]-d_RGBmap[(idx+i+width*(idy+j))*3+2]));

try to put it into shared memory… this is totally not coallesced…

I guess most of the time is being wasted because of un-coallesced and un-optimized memory access patterns…

eyal

dorstrem · February 8, 2010, 2:16pm

Thanks for your answer,

But I don’t understand how to use the shared memory for d_color. Have I to define a shared array outside the double for loop with the idx and idy index and/or to define a shared array inside the double for loop with the idx, idy, i and j index ?

I know that the biggest problem comes from the un-coallesced memory access, but I don’t understand how to optimize it and how to influence it.

Can you explain ?

eyalhir74 · February 8, 2010, 2:28pm

Lets assume that this value:

KERNEL_RADIUS-i+(KERNEL_W)*(KERNEL_RADIUS-j)

evaluates to a maximum of 256, and that you open 256 threads per block.

Then you could do something like this:

__shared__ float smValues[ 256 ];

int iThreadPosition = threadIdx.x;   // Or any other value.

smValues[ threadIdx.x ] = d_skernel[ iThreadPosition ];

__syncthreads();

// Now you can use the shared memory and not the global memory directly...

for (int i=-KERNEL_RADIUS;i<=KERNEL_RADIUS;i++)

{

   for (int j=-KERNEL_RADIUS;j<=KERNEL_RADIUS;j++)

   {

	  w_spatial= smValue[KERNEL_RADIUS-i+(KERNEL_W)*(KERNEL_RADIUS-j)];

	  ....

   }

}

If the data doesn’t fit the shared memory array you can move the smvalues and __syncthreads

code inside one of the loops.

hope this is clearer…

I’m sure the programming guide can explain shared memory better than me :)

eyal