Sobel filter Kernel-Please comment

Hi all,

I have written a Kernel for sobel filter which is giving me a speed-up of about 85 for a 256 (8-bit ) color bmp image of size 3264x2448.

CPU computation takes ~ 428millisecs and GPU Kernel takes approx 5 millisecs.

I’ve called the kernel for this execution conf:

block.x = 16;

block.y = 16;

grid.x = (bmpheader.biWidth % block.x ) + bmpheader.biWidth / block.x;

grid.y = (bmpheader.biHeight % block.y ) + bmpheader.biHeight / block.y;

Sobel_kernel<<<grid, block>>>(gpu_output_img, bmpheader.biWidth, bmpheader.biHeight);

cudaThreadSynchronize(); 

Here is the kernel.

global void Sobel_kernel(unsigned char *g_output, int width, int height)

{

int p00, p01, p02, p10,  p12, p20,p21, p22;

int x = blockDim.x * blockIdx.x + threadIdx.x;

int y = blockDim.y * blockIdx.y + threadIdx.y;

int sumx, sumy, sum;

if( x < width && y < height )

{

    if(x == 0 || x == width - 1) sum = 0;

    else if(y == 0 || y == height - 1) sum = 0;

    else

    {

        p00 = tex2D( tex, x - 1, y - 1 );

        p01 = tex2D( tex, x - 1, y);

        p02 = tex2D( tex, x - 1, y + 1 );

        p10 = tex2D( tex, x, y - 1 );

        //p11 = tex2D( tex, x, y );

        p12 = tex2D( tex, x, y + 1 );

        p20 = tex2D( tex, x + 1, y - 1 );

        p21 = tex2D( tex, x + 1, y );

        p22 = tex2D( tex, x + 1, y + 1 );

sumx = p00 - p02 + ((p10 - p12) << 1) + p20 - p22;

        sumy = (p00) + ((p01 - p21 ) << 1)  + (p02) - p20 -  (p22 );

        sum = abs(sumx) + abs(sumy);

        if(sum > 255) sum = 255;

    }

    g_output[x + width*y] = 255 - (unsigned char)sum;

}

return;

}

Edited: CPU code used is unoptimized. Optimized CPU code would take 152 millisecs for the same image. Speed up is reduced to 30 after CPU code is optimized.

Please comment on the drawback of this kernel and how to optimize this kernel to get better speed-up.

Thanks for your time.

I would have each thread compute 4 consecutive (along the X axis) pixels and combine the resulting output into an int variable (after bit shifting). This allows you to perform coalesced writes to the output array. Right now the individual byte writes are uncoalesced.

Computing several pixels in a row, you can reuse prior texture lookups by simply reassigning variables, leading to 3 additional texture fetches per additional pixel.

Christian

Would you like to benchmark this kernel instead? I’ve tried to get the compiler to produce more optimal PTX by using some tricks. I have not tested this for correctness of results yet, but it does compile. I am assuming that the texture is stored as unsigned chars (not floats).

Update: There is another thread in this forum explaining the impact of using the “virtual” keyword for some strategic variables. ;)

////////////////////////////////////////////////////////////////////////////////

//! Sobel filter kernel

////////////////////////////////////////////////////////////////////////////////

#define define_float4(var,a,b,c,d) float4 var; var.x=a; var.y=b; var.z=c; var.w=d;

__global__ void Sobel_kernel(unsigned char *g_output, int width, int height)

{

    int p00, p01, p02, p10, p12, p20,p21, p22;

    int x = blockDim.x * blockIdx.x + threadIdx.x;

    int y = blockDim.y * blockIdx.y + threadIdx.y;

    float fx =(float)( x );

    float fy =(float)( y );

   int sumx, sumy, sum;

   if( x < width && y < height )

    {

        if(x == 0 || x == width - 1) sum = 0;

        else if(y == 0 || y == height - 1) sum = 0;

        else

        {

            // perform the computation of texture offsets with minimal amount

            // of overhead (store intermediate results in registers ASAP)

            // use of volatile is a trick that helps force values into registers

            // instead of the compiler inlining the computations later

            volatile float zero = 0.0f;

            volatile float one = 1.0f;

            volatile float xm1 = fx - one;

            volatile float xp1 = fx + one;

            volatile float ym1 = fy - one;

            volatile float yp1 = fy + one;

           // I am calling into __utexfetch instead of using tex2D inline function:

            // use of our define_float4() macro may help saving some registers and it allows

            // access of constant "zero" stored in a register.

            // unfortunately there are still 8 mov commands between texture fetches, of which

            // 4 are redundant. Let's hope there is some peephole optimizer in the assembler

            // that gets rid of these commands.

            define_float4(c00, xm1, ym1, zero, zero); p00 = ((uint4)__utexfetch(tex, c00)).x;

            define_float4(c01, xm1, fy,  zero, zero); p01 = ((uint4)__utexfetch(tex, c01)).x;

            define_float4(c02, xm1, yp1, zero, zero); p02 = ((uint4)__utexfetch(tex, c02)).x;

            define_float4(c10, fx,  ym1, zero, zero); p10 = ((uint4)__utexfetch(tex, c10)).x;

            define_float4(c12, fx,  yp1, zero, zero); p12 = ((uint4)__utexfetch(tex, c12)).x;

            define_float4(c20, xp1, ym1, zero, zero); p20 = ((uint4)__utexfetch(tex, c20)).x;

            define_float4(c21, xp1, fy,  zero, zero); p21 = ((uint4)__utexfetch(tex, c21)).x;

            define_float4(c22, xp1, yp1, zero, zero); p22 = ((uint4)__utexfetch(tex, c22)).x;

           sumx = p00 - p02 + ((p10 - p12) << 1) + p20 - p22;

            sumy = (p00) + ((p01 - p21 ) << 1) + (p02) - p20 - (p22 );

            sum = abs(sumx) + abs(sumy);

            if(sum > 255) sum = 255;

        }

        g_output[x + width*y] = 255 - (unsigned char)sum;

    }

    return;

}

Hi Christian

Your Kernel is working perfectly. And yes, the texture is stored as unsigned chars.

I profiled your Kernel using QueryPerformanceCounter. It takes 5.2 milliseconds. Its almost same as mine. But the number of registers used (per thread) has come down from 14 to 12.

I should try to implement the kernel using the approach you specified in your first comment!!! Thanks for the time. I’ll update once I finish writing the improved version.

Take care at the grid size. I think you want:

grid.x = bmpheader.biWidth / block.x + ((bmpheader.biWidth % block.x)!=0 )

I’m interested too in the results of the modified algorithm.

Also, you said it was a color bmp. Are you computing the y value in a different kernel?