Hi everybody,
I’m quite new in CUDA.
I try to learn it by developing simple algorithms (image filters in this case).
My system:
WinXP , GeForce8600GT , VS2008
My problem is about coding the bilateral filter (like the OpenCV smooth method does) on my GPU.
I succeeded to make it work, BUT I encountered strange phenomenons which look like bugs :-)
The following code is the kernel implementation of the bilateral smooth method.
Please don’t care about optimizing computation at first.
#define BLOCK_SIZE_X 32
#define BLOCK_SIZE_Y 16
__global__ void cudaKernel_smoothBilateral( CudaData _input , CudaData _output , int _half_win_size , float _inv_2_sigma_R_squared , float *_domain_weight_profile )
{
int half_win_size = _half_win_size;
int i = blockIdx.x*blockDim.x+threadIdx.x;
int j = blockIdx.y*blockDim.y+threadIdx.y;
int width = _input.width;
int height = _input.height;
int fullsize = _input.size;
int k = width*j + i;
float *input_ptr = (float*)_input.cuda_data;
float *output_ptr = (float*)_output.cuda_data;
if( i >= half_win_size && i <= width-half_win_size && j >= half_win_size && j <= height-half_win_size )
{
float currentPixel = input_ptr[k];
float finalPixel = currentPixel;
int inc = 0;
float F = 0.0;
float K = 0.0;
float total_weight;
float range_weight;
for(int yy = -half_win_size ; yy <= half_win_size ; ++yy)
{
for(int xx = -half_win_size ; xx <= half_win_size ; ++xx )
{
float neighborPixel = input_ptr[ width*(j + yy) + (i + xx) ];
float diff_squared = (neighborPixel - currentPixel);
diff_squared*=diff_squared;
range_weight = exp(-diff_squared*_inv_2_sigma_R_squared);
total_weight = range_weight*_domain_weight_profile[ inc ];
F += total_weight*neighborPixel;
K += total_weight;
++inc;
}
}
if(K != 0.0)
finalPixel = F / K;
output_ptr[ k ] = finalPixel;
}
else if( i < width && j < height )
{
output_ptr[ k ] = 0.0;
}
}
The images are “interfaced” this way
class CudaData
{
public:
void* cuda_data; // void because either float or char
unsigned int width;
unsigned int height;
unsigned int nChannel;
unsigned long full_size; // in BYTE
unsigned long size; // width*height*nChannel
/* etc ... */
};
The strange phenomenons :
-
This algo crashes (exception thrown) with a block size of 3216 but works with a block size of 1616 or smaller!!!
-
With a block size of 32*16, if I replace the line
int half_win_size = _half_win_size;
by
int half_win_size = 2;
, the algo works fine!
I can’t get through this problem by myself. Could you please help me find ways to solve it?
Fab