Image Filter Bilateral Filter implementation

Hi everybody,

I’m quite new in CUDA.

I try to learn it by developing simple algorithms (image filters in this case).

My system:

WinXP , GeForce8600GT , VS2008

My problem is about coding the bilateral filter (like the OpenCV smooth method does) on my GPU.

I succeeded to make it work, BUT I encountered strange phenomenons which look like bugs :-)

The following code is the kernel implementation of the bilateral smooth method.

Please don’t care about optimizing computation at first.

#define BLOCK_SIZE_X 32

#define BLOCK_SIZE_Y 16

__global__ void cudaKernel_smoothBilateral( CudaData _input , CudaData _output , int _half_win_size , float _inv_2_sigma_R_squared , float *_domain_weight_profile )

{

        int half_win_size = _half_win_size;

	int i = blockIdx.x*blockDim.x+threadIdx.x;

	int j = blockIdx.y*blockDim.y+threadIdx.y;

	

	int width = _input.width;

	int height = _input.height;

	int fullsize = _input.size;

	int k = width*j + i;

	float *input_ptr = (float*)_input.cuda_data;

	float *output_ptr = (float*)_output.cuda_data;

	if( i >= half_win_size && i <= width-half_win_size && j >= half_win_size && j <= height-half_win_size  )

	{

		float currentPixel = input_ptr[k];	

		float finalPixel = currentPixel;

		int inc = 0;

		float F = 0.0;

		float K = 0.0;

		float total_weight;

		float range_weight;

		for(int yy = -half_win_size ; yy <= half_win_size ; ++yy)

		{

			for(int xx = -half_win_size ; xx <= half_win_size ; ++xx )

			{	

				float neighborPixel = input_ptr[ width*(j + yy) + (i + xx) ];

				

				float diff_squared = (neighborPixel - currentPixel);

				diff_squared*=diff_squared;

				range_weight = exp(-diff_squared*_inv_2_sigma_R_squared);

				total_weight = range_weight*_domain_weight_profile[ inc ];

					

				F += total_weight*neighborPixel;

				K += total_weight;

				 

				++inc;

			}

		}

		if(K != 0.0)

			finalPixel = F / K; 

		

		output_ptr[ k ] = finalPixel;

	}

	else if( i < width && j < height )

	{

		output_ptr[ k ] = 0.0;

	}

	

}

The images are “interfaced” this way

class CudaData

{

public:

	void* cuda_data; // void because either float or char

	unsigned int width;

	unsigned int height;

	unsigned int nChannel;

	unsigned long full_size; // in BYTE

	unsigned long size; // width*height*nChannel

/* etc ... */

};

The strange phenomenons :

  • This algo crashes (exception thrown) with a block size of 3216 but works with a block size of 1616 or smaller!!!

  • With a block size of 32*16, if I replace the line

int half_win_size = _half_win_size;

by

int half_win_size = 2;

, the algo works fine!

I can’t get through this problem by myself. Could you please help me find ways to solve it?

Fab