# Image processing

Hello everybody, i have to develop a program wich process a greyscale image and i am far from being a killer in cuda

I’ve got a problem, let me explain :

For example an image with resolution 10x10 is just a one-dimensional array like that :

For my program, i have to process the green cells and i don’t touch the red cells.

This is how i do in C++ :

``````for(int i=1;i<(height-1);i++)

{

for(int j=1;j<(width-1);j++)

{

Gx1 = Gy1 = datanvg[((i-1)*width)+(j-1)];

Gx2 = Gy2 = datanvg[((i-1)*width)+j];

Gx3 = Gy3 = datanvg[((i-1)*width)+(j+1)];            // In this part i use the openCV library :

Gx4 = Gy4 = datanvg[(i*width)+(j-1)];                // I get the value of pixels located around the pixel that I process

Gx5 = Gy5 = datanvg[(i*width)+j];                    // and then I perform the process.

Gx6 = Gy6 = datanvg[(i*width)+(j+1)];

Gx7 = Gy7 = datanvg[((i+1)*width)+(j-1)];

Gx8 = Gy8 = datanvg[((i+1)*width)+j];

Gx9 = Gy9 = datanvg[((i+1)*width)+(j+1)];

Gx = (Gx1*matGx1)+(Gx2*matGx2)+(Gx3*matGx3)+(Gx4*matGx4)+(Gx5*matGx5)+(Gx6*matGx6)+(Gx7*matGx7)+(Gx8*matGx8)+(Gx9*matGx9); // This is the process

Gy = (Gy1*matGy1)+(Gy2*matGy2)+(Gy3*matGy3)+(Gy4*matGy4)+(Gy5*matGy5)+(Gy6*matGy6)+(Gy7*matGy7)+(Gy8*matGy8)+(Gy9*matGy9);

datasob[(i*width)+j] = sqrt(powf(Gx,2)+powf(Gy,2)); // I put the new value into a new image

}

}
``````

For more understanding here is what I do for the pixel located in the cell 55:

(matGx and matGy are int that i declared above)

Gx = ((value of cell 44) * matGx1) + ((value of cell 45) * matGx2) + ((value of cell 46) * matGx3) + ((value of cell 54) * matGx4) + ((value of cell 55) * matGx5) + … + ((value of cell 66) * matGx9);

Gx = ((value of cell 44) * matGy1) + ((value of cell 45) * matGy2) + ((value of cell 46) * matGy3) + ((value of cell 54) * matGy4) + ((value of cell 55) * matGy5) + … + ((value of cell 66) * matGy9);

Final value of the pixel 55 = sqrt(powf(Gx,2)+powf(Gy,2));

All of this is C++ and i would like to make it in a cuda kernel (send the greyscale array and receive the processed array).

In fact, i already try to make the double for-loop in the cuda kernel … i dont know how but i just broke my 8800 gtx so i’m a little bit scared about cuda.

can you give me some ideas to make this kernel?

Thank you and i hope i will be understood, if you want a clarification, ask me!

I want to do something like that :

``````__global__ void Sobel(int* Gris, int* Sobel, int height, int width)

{

int i = blockDim.x * blockIdx.x + threadIdx.x;

int j = blockDim.y * blockIdx.y + threadIdx.y;

for(i=1;i<(height-1);i++)

{

for(j=1;j<(width-1);j++)

{

Gx1 = Gy1 = Gris[((i-1)*width)+(j-1)];

Gx2 = Gy2 = Gris[((i-1)*width)+j];

Gx3 = Gy3 = Gris[((i-1)*width)+(j+1)];

Gx4 = Gy4 = Gris[(i*width)+(j-1)];

Gx5 = Gy5 = Gris[(i*width)+j];

Gx6 = Gy6 = Gris[(i*width)+(j+1)];

Gx7 = Gy7 = Gris[((i+1)*width)+(j-1)];

Gx8 = Gy8 = Gris[((i+1)*width)+j];

Gx9 = Gy9 = Gris[((i+1)*width)+(j+1)];

Gx = (Gx1*1)+(Gx2*0)+(Gx3*(-1))+(Gx4*2)+(Gx5*0)+(Gx6*(-2))+(Gx7*1)+(Gx8*0)+(Gx9*(-1));  // This is the value of my matGx and matGy above

Gy = (Gy1*1)+(Gy2*2)+(Gy3*1)+(Gy4*0)+(Gy5*0)+(Gy6*0)+(Gy7*(-1))+(Gy8*(-2))+(Gy9*(-1));  // I put all values to be well understood

Sobel[(i*width)+j] = sqrt(powf(Gx,2)+powf(Gy,2));   //I'm not shure, i can do this like that

}

}

}
``````

Is it a good way to do ?

A much better way that actually uses the parallelism that gives CUDA its power would be

``````__global__ void Sobel(int* Gris, int* Sobel, int height, int width)

{

int i = blockDim.x * blockIdx.x + threadIdx.x + 1;

int j = blockDim.y * blockIdx.y + threadIdx.y + 1;

if (i < (height-1))

{

if (j < (width-1))

{

Gx1 = Gy1 = Gris[((i-1)*width)+(j-1)];

Gx2 = Gy2 = Gris[((i-1)*width)+j];

Gx3 = Gy3 = Gris[((i-1)*width)+(j+1)];

Gx4 = Gy4 = Gris[(i*width)+(j-1)];

Gx5 = Gy5 = Gris[(i*width)+j];

Gx6 = Gy6 = Gris[(i*width)+(j+1)];

Gx7 = Gy7 = Gris[((i+1)*width)+(j-1)];

Gx8 = Gy8 = Gris[((i+1)*width)+j];

Gx9 = Gy9 = Gris[((i+1)*width)+(j+1)];

Gx = (Gx1*1)+(Gx2*0)+(Gx3*(-1))+(Gx4*2)+(Gx5*0)+(Gx6*(-2))+(Gx7*1)+(Gx8*0)+(Gx9*(-1));  // This is the value of my matGx and matGy above

Gy = (Gy1*1)+(Gy2*2)+(Gy3*1)+(Gy4*0)+(Gy5*0)+(Gy6*0)+(Gy7*(-1))+(Gy8*(-2))+(Gy9*(-1));  // I put all values to be well understood

Sobel[(i*width)+j] = sqrt(powf(Gx,2)+powf(Gy,2));   //I'm not shure, i can do this like that

}

}

}
``````

The previous code did the same operations with the same data in all threads.

It looks like you want to do an ordinary 2D convolution, look at the convolution examples in the CUDA SDK.

I’m sorry i was very busy during last few days. I tried the kernel you showed to me but it doesn’t work, i get a black picture instead of a sobel filtered picture.

This is the key parts of my code :

both kernels : (first to convert the picture on greyscale and second to convert greyscale to sobel filtered)

``````__global__ void Nvg(int* Red, int* Blue, int* Green, int* Grey, int N)

{

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

{

Grey[i] = ((Red[i] * 0.299) + (Blue[i] * 0.114) + (Green[i] * 0.587))/3;

}

}

__global__ void Sobel(int* Grey, int* Sobel, int height, int width)

{

int i = blockDim.x * blockIdx.x + threadIdx.x + 1;

int j = blockDim.y * blockIdx.y + threadIdx.y + 1;

int Gx = 0;

int Gy = 0;

if(i < (height-1))

{

if(j < (width-1))

{

Gx = (Grey[((i-1)*width)+(j-1)]*1)

+(Grey[((i-1)*width)+(j+1)]*(-1))

+(Grey[(i*width)+(j-1)]*2)

+(Grey[(i*width)+(j+1)]*(-2))

+(Grey[((i+1)*width)+(j-1)]*1)

+(Grey[((i+1)*width)+(j+1)]*(-1));

Gy = (Grey[((i-1)*width)+(j-1)]*1)

+(Grey[((i-1)*width)+j]*2)

+(Grey[((i-1)*width)+(j+1)]*1)

+(Grey[((i+1)*width)+(j-1)]*(-1))

+(Grey[((i+1)*width)+j]*(-2))

+(Grey[((i+1)*width)+(j+1)]*(-1));

Sobel[(i*width)+j] = sqrt(powf(Gx,2)+powf(Gy,2));

}

}

}
``````

Allocation and copying data :

``````cutilSafeCall( cudaMalloc((void**)&d_Red, size) );

cutilSafeCall( cudaMalloc((void**)&d_Blue, size) );

cutilSafeCall( cudaMalloc((void**)&d_Green, size) );

cutilSafeCall( cudaMalloc((void**)&d_Grey, size) );

cutilSafeCall( cudaMalloc((void**)&d_Sobel, size) );

cutilSafeCall( cudaMemcpy(d_Red, h_Red, size, cudaMemcpyHostToDevice) );

cutilSafeCall( cudaMemcpy(d_Blue, h_Blue, size, cudaMemcpyHostToDevice) );

cutilSafeCall( cudaMemcpy(d_Green, h_Green, size, cudaMemcpyHostToDevice) );
``````

Call both kernels and copying results:

``````int threadsPerBlock = 256;

Nvg<<<blocksPerGrid, threadsPerBlock>>>(d_Red, d_Blue, d_Green, d_Grey, N);

cutilSafeCall( cudaMemcpy(h_Grey, d_Grey, size, cudaMemcpyDeviceToHost) );

cutilSafeCall( cudaMemcpy(h_Sobel, d_Sobel, size, cudaMemcpyDeviceToHost) );
``````

Then i fill both pictures (Greyscale and Sobel with values from h_Grey and h_Sobel).

Greyscale works fine but Sobel is all black (all values of h_Sobel = 0). I don’t understand why…

my idea is a little different from all of yours, before the kernel starts i bind the input global memory to a texture memory.the texture memory is referenced in kernel.

my idea is a little different from all of yours, before the kernel starts i bind the input global memory to a texture memory.the texture memory is referenced in kernel.

if( idx < width && idy <height)
{

``````	 float  ul;
float  um;
float  ur;
float  ml;
float  mm;
float  mr;
float  ll;
float  lm;
float  lr;
float Horz;
float Vert;
float Ix2;
float Ixy;
float Iy2;

ul = tex2D( Coffe, (float)(idx-1), (float)(idy-1) );
um = tex2D( Coffe, (float)(idx+0), (float)(idy-1) );
ur = tex2D( Coffe, (float)(idx+1), (float)(idy-1) );
ml = tex2D( Coffe, (float)(idx-1), (float)(idy+0) );
mm = tex2D( Coffe, (float)(idx+0), (float)(idy+0) );
mr = tex2D( Coffe, (float)(idx+1), (float)(idy+0) );
ll = tex2D( Coffe, (float)(idx-1), (float)(idy+1) );
lm = tex2D( Coffe, (float)(idx+0), (float)(idy+1) );
lr = tex2D( Coffe, (float)(idx+1), (float)(idy+1) );
........``````

1.- Take care of all the boundary of the image :

``````if ( ( i > 0 && i < width - 1) && ( j > 0 && j < height - 1) )

{

....

}
``````

2.- Run the kernel in a two dimension grid:

``````#define DIM_BLOCK 16

// kernel configuration

dim3 dimBlock(DIM_BLOCK, DIM_BLOCK, 1);

dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
``````

I tested your kernel and it works. It’s slow but you will have time for optimization, so dont worry for this right now :)

EDIT: I didnt notice you add +1 to i and j when mapping from threadIdx/BlockIdx to pixel position. Did you consider that when compute the Sobel? You have +1 offset respect to the current pixel. I would remove +1 from the i, j calculation and check it in the if condition i commented above.

Hope this help.

OMG It works MOUHAHAHAHA (Evil Laught)!

Can i just annoy you one more time ?

What’s the difference between this two kernel calls?

``````threadsPerBlock = 256;

``````
``````dim3 dimBlock(DIM_BLOCK, DIM_BLOCK, 1);

dim3 dimGrid(larg / dimBlock.x, haut / dimBlock.y, 1);

Sobel<<<dimGrid, dimBlock>>>(d_Gris, d_Sobel, haut, larg);
``````

In the this kernel you have a 1D grid of blocks and threads and you will make reference to the threads in the grid using threadIdx.x, blockIdx.x and so on.

And in this kernel you have a 2D grid, you will make reference to the threads as you would in an matrix (row, col) so here you use threadIdx.x, threadIdx.y, blockIdx.x, blockIdx.y, blockDim.x and so on.

Hope that helps :)

P.S. I’m making some stuff of image processing so if you want I can share with you my sobel code and maybe you can adapt it to your needs !