bilateral filter with gpu accelaration

I have implemented a bilateral filter running on the CPU and wanted to change the implementation to run on a GPU, but i have no clue how to start.

void bilateralBlur(IplImage* input_img, IplImage* output_img, int sigmaS, int sigmaR)


// Create Gaussian/Bilateral filter --- mask ---

    int length = 9;

std::vector<float> mask(length);

for (int i = 0; i < length; i++)


          mask[i] =  exp(-0.5*((i*i)/(sigmaS*sigmaS)));

          //fprintf( stderr, "i = %d,  value= %f \n", i, mask[i]);


int m_width = input_img->width ;

    int m_height= input_img->height ;


    uchar* src_img= (uchar*) input_img->imageData ;

    //Ausgangs bild ist genau so groß wie das Eingangsbild!

    uchar* dst_img =(uchar*) output_img->imageData ;

double wp, k;

//evtl. mit y=-length/2; y<=m_height-length/2; y+=4

    for(int y=0; y<m_height; y+=1)


        for (int x=0; x<m_width; x+=1){

int centerPix =x + y*m_width;

for (int j=-length/2;j <=length/2;j+=1){

                for (int i=-length/2;i<=length/2;i+=1){

int curPix = x+i+(y+j)*(m_width);


                    //spatial diff

                    double delta = sqrt(i*i+j*j);

                    double euklidDiff=exp(-0.5 * pow(delta/sigmaS,2));


                    double intens = src_img-src_img[curPix];

		    double factor = exp(-0.5 * pow(intens/sigmaR,2)) * euklidDiff;

                    wp += factor * src_img[curPix];

                    k += factor;



dst_img     = wp/k;







Use a Cuda 2D-grid with size(width,height), block size 16x16 for example. This replaces the (x,y) loop.
In the kernel code process one target pixel with (i,j) loop and write the target value.

You could also use an OpenGL/Direct3D pixel shader for this.


