Hi all,

I have written a Kernel for sobel filter which is giving me a speed-up of about 85 for a 256 (8-bit ) color bmp image of size 3264x2448.

CPU computation takes ~ 428millisecs and GPU Kernel takes approx 5 millisecs.

I’ve called the kernel for this execution conf:

block.x = 16;

```
block.y = 16;
grid.x = (bmpheader.biWidth % block.x ) + bmpheader.biWidth / block.x;
grid.y = (bmpheader.biHeight % block.y ) + bmpheader.biHeight / block.y;
Sobel_kernel<<<grid, block>>>(gpu_output_img, bmpheader.biWidth, bmpheader.biHeight);
cudaThreadSynchronize();
```

Here is the kernel.

**global** void Sobel_kernel(unsigned char *g_output, int width, int height)

{

```
int p00, p01, p02, p10, p12, p20,p21, p22;
int x = blockDim.x * blockIdx.x + threadIdx.x;
int y = blockDim.y * blockIdx.y + threadIdx.y;
int sumx, sumy, sum;
```

if( x < width && y < height )

```
{
if(x == 0 || x == width - 1) sum = 0;
else if(y == 0 || y == height - 1) sum = 0;
else
{
p00 = tex2D( tex, x - 1, y - 1 );
p01 = tex2D( tex, x - 1, y);
p02 = tex2D( tex, x - 1, y + 1 );
p10 = tex2D( tex, x, y - 1 );
//p11 = tex2D( tex, x, y );
p12 = tex2D( tex, x, y + 1 );
p20 = tex2D( tex, x + 1, y - 1 );
p21 = tex2D( tex, x + 1, y );
p22 = tex2D( tex, x + 1, y + 1 );
```

sumx = p00 - p02 + ((p10 - p12) << 1) + p20 - p22;

```
sumy = (p00) + ((p01 - p21 ) << 1) + (p02) - p20 - (p22 );
sum = abs(sumx) + abs(sumy);
if(sum > 255) sum = 255;
}
g_output[x + width*y] = 255 - (unsigned char)sum;
}
return;
```

}

Edited: CPU code used is unoptimized. Optimized CPU code would take 152 millisecs for the same image. Speed up is reduced to 30 after CPU code is optimized.

Please comment on the drawback of this kernel and how to optimize this kernel to get better speed-up.

Thanks for your time.