Optimize bilateral filter

Hi all!

I’m using this bilateral filter in my project, but this function is not optimized at all.

Is there someone who can help me to optimize it ?

Using shared memory or other, I don’t understand how to use it.



/*  Name kernelBF.cu									*/

/*  Last date of modification : 20/04/09				*/

#ifndef _KERNEL_H_

#define _KERNEL_H_

#define GDATA(base,iDx,iDy,pitch) *((float*)((char*)(base) + (iDy) * pitch) + (iDx))  // used to cooperate with the pitch created with CudaMallocPitch

__global__ void

KernelBF(float* d_idata, size_t pitch_in, 

		 U16 * d_RGBmap,  // U16 == (unsigned short int)

		 float* d_odata, size_t pitch_out,

		 unsigned int width, unsigned int height,

		 float * d_skernel) 

/* Kernel Parameters */

/* d_idata : disparity map before filtering, pitch_idata: pitch created with CudaMallocPitch

/* d_odata : disparity map after filtering, pitch_pdata: pitch created with Cudamallocpitch

/* d_RGBmap : Color map (R,G,B) size  2*width*height, pitch_color: pitch created with CudaMallocPitch

/* width : largeur, height : hauteur (nbcol, nbrow) 

/* d_skernel : kernel computed on the host side containing the coefficients for spatial filtering


	/********* Variables used in BF computation ********/	


	float sig_color = 200.0f; // value distance sigma

	float sig_disp = 200.0f; // disparity distance sigma

	//	float sig_spatial = 200.0f; 



	float d_disp=0.0f,d_color=0.0f;

	float w_spatial=1.0f, w_color=1.0f,w_disp=1.0f,w_final=1.0f;

	// the spatial distance sigma is defined in BF.cu


	// Thread Index

	const int tix=threadIdx.x;

	const int tiy=threadIdx.y;

	// Global Index

	int idx=blockIdx.x*blockDim.x+tix;

	int idy=blockIdx.y*blockDim.y+tiy;

	if ((idx <width)&&(idy<height)){

		float sum=0.0f;

		float w_t=0.0f;

		for (int i=-KERNEL_RADIUS;i<=KERNEL_RADIUS;i++){

			for (int j=-KERNEL_RADIUS;j<=KERNEL_RADIUS;j++){

				if ( ((idx+i)>=0)&&((idy+j)>=0)&&((idx+i)<width)&&((idy+j)<height)){



					d_color=(float) (	abs(d_RGBmap[(idx+width*idy)*3]-d_RGBmap[(idx+i+width*(idy+j))*3])+













			}// for j

		}// for i

		if (w_t !=0 ){ 

			GDATA(d_odata,idx,idy,pitch_out)= (float)(sum/w_t);

		}else{ GDATA(d_odata,idx,idy,pitch_out)=0.0f;}


		}// if*/





#endif // #ifndef _TEMPLATE_KERNEL_H_

UP !


It seems the first place to start would be to reduce the number of global memory reads you’re doing.

For example, the following code:



   for (int j=-KERNEL_RADIUS;j<=KERNEL_RADIUS;j++)






It would probably be much better to read the data from d_skernel into shared memory and have the threads/different

iterations of i and j use the data in the shared memory, rather than reading the same data over and over again.

In case KERNEL_RADIUS and KERNEL_W are very big, you can split the shared memory reads into chunks or other


Same goes to this:

d_color=(float) (	abs(d_RGBmap[(idx+width*idy)*3]-d_RGBmap[(idx+i+width*(idy+j))*3])+



try to put it into shared memory… this is totally not coallesced…

I guess most of the time is being wasted because of un-coallesced and un-optimized memory access patterns…


Thanks for your answer,

But I don’t understand how to use the shared memory for d_color. Have I to define a shared array outside the double for loop with the idx and idy index and/or to define a shared array inside the double for loop with the idx, idy, i and j index ?

I know that the biggest problem comes from the un-coallesced memory access, but I don’t understand how to optimize it and how to influence it.

Can you explain ?

Lets assume that this value:


evaluates to a maximum of 256, and that you open 256 threads per block.

Then you could do something like this:

__shared__ float smValues[ 256 ];

int iThreadPosition = threadIdx.x;   // Or any other value.

smValues[ threadIdx.x ] = d_skernel[ iThreadPosition ];


// Now you can use the shared memory and not the global memory directly...



   for (int j=-KERNEL_RADIUS;j<=KERNEL_RADIUS;j++)


	  w_spatial= smValue[KERNEL_RADIUS-i+(KERNEL_W)*(KERNEL_RADIUS-j)];




If the data doesn’t fit the shared memory array you can move the smvalues and __syncthreads

code inside one of the loops.

hope this is clearer…

I’m sure the programming guide can explain shared memory better than me :)