Image processing

Arnibu · April 10, 2011, 8:19am

Hello everybody, i have to develop a program wich process a greyscale image and i am far from being a killer in cuda External Image

I’ve got a problem, let me explain :

For example an image with resolution 10x10 is just a one-dimensional array like that :

External Media

For my program, i have to process the green cells and i don’t touch the red cells.

This is how i do in C++ :

for(int i=1;i<(height-1);i++) 

{

	for(int j=1;j<(width-1);j++)

	{

		Gx1 = Gy1 = datanvg[((i-1)*width)+(j-1)];

		Gx2 = Gy2 = datanvg[((i-1)*width)+j];

		Gx3 = Gy3 = datanvg[((i-1)*width)+(j+1)];            // In this part i use the openCV library : 

		Gx4 = Gy4 = datanvg[(i*width)+(j-1)];                // I get the value of pixels located around the pixel that I process 

		Gx5 = Gy5 = datanvg[(i*width)+j];                    // and then I perform the process.

		Gx6 = Gy6 = datanvg[(i*width)+(j+1)];

		Gx7 = Gy7 = datanvg[((i+1)*width)+(j-1)];

		Gx8 = Gy8 = datanvg[((i+1)*width)+j];

		Gx9 = Gy9 = datanvg[((i+1)*width)+(j+1)];

	

		Gx = (Gx1*matGx1)+(Gx2*matGx2)+(Gx3*matGx3)+(Gx4*matGx4)+(Gx5*matGx5)+(Gx6*matGx6)+(Gx7*matGx7)+(Gx8*matGx8)+(Gx9*matGx9); // This is the process

		Gy = (Gy1*matGy1)+(Gy2*matGy2)+(Gy3*matGy3)+(Gy4*matGy4)+(Gy5*matGy5)+(Gy6*matGy6)+(Gy7*matGy7)+(Gy8*matGy8)+(Gy9*matGy9);

		datasob[(i*width)+j] = sqrt(powf(Gx,2)+powf(Gy,2)); // I put the new value into a new image 

	}

}

For more understanding here is what I do for the pixel located in the cell 55:

(matGx and matGy are int that i declared above)

Gx = ((value of cell 44) * matGx1) + ((value of cell 45) * matGx2) + ((value of cell 46) * matGx3) + ((value of cell 54) * matGx4) + ((value of cell 55) * matGx5) + … + ((value of cell 66) * matGx9);

Gx = ((value of cell 44) * matGy1) + ((value of cell 45) * matGy2) + ((value of cell 46) * matGy3) + ((value of cell 54) * matGy4) + ((value of cell 55) * matGy5) + … + ((value of cell 66) * matGy9);

Final value of the pixel 55 = sqrt(powf(Gx,2)+powf(Gy,2));

All of this is C++ and i would like to make it in a cuda kernel (send the greyscale array and receive the processed array).

In fact, i already try to make the double for-loop in the cuda kernel … i dont know how but i just broke my 8800 gtx so i’m a little bit scared about cuda.

can you give me some ideas to make this kernel?

Thank you and i hope i will be understood, if you want a clarification, ask me!

Arnibu · April 10, 2011, 9:12am

I want to do something like that :

__global__ void Sobel(int* Gris, int* Sobel, int height, int width)

{

        int i = blockDim.x * blockIdx.x + threadIdx.x;

        int j = blockDim.y * blockIdx.y + threadIdx.y;

for(i=1;i<(height-1);i++) 

	{

		for(j=1;j<(width-1);j++)

		{

			Gx1 = Gy1 = Gris[((i-1)*width)+(j-1)];

			Gx2 = Gy2 = Gris[((i-1)*width)+j];

			Gx3 = Gy3 = Gris[((i-1)*width)+(j+1)];

			Gx4 = Gy4 = Gris[(i*width)+(j-1)];      

			Gx5 = Gy5 = Gris[(i*width)+j];

			Gx6 = Gy6 = Gris[(i*width)+(j+1)];

			Gx7 = Gy7 = Gris[((i+1)*width)+(j-1)];

			Gx8 = Gy8 = Gris[((i+1)*width)+j];

			Gx9 = Gy9 = Gris[((i+1)*width)+(j+1)];

			Gx = (Gx1*1)+(Gx2*0)+(Gx3*(-1))+(Gx4*2)+(Gx5*0)+(Gx6*(-2))+(Gx7*1)+(Gx8*0)+(Gx9*(-1));  // This is the value of my matGx and matGy above

			Gy = (Gy1*1)+(Gy2*2)+(Gy3*1)+(Gy4*0)+(Gy5*0)+(Gy6*0)+(Gy7*(-1))+(Gy8*(-2))+(Gy9*(-1));  // I put all values to be well understood

			Sobel[(i*width)+j] = sqrt(powf(Gx,2)+powf(Gy,2));   //I'm not shure, i can do this like that

		}

	}

}

Is it a good way to do ?

tera · April 10, 2011, 12:03pm

A much better way that actually uses the parallelism that gives CUDA its power would be

__global__ void Sobel(int* Gris, int* Sobel, int height, int width)

{

        int i = blockDim.x * blockIdx.x + threadIdx.x + 1;

        int j = blockDim.y * blockIdx.y + threadIdx.y + 1;

if (i < (height-1)) 

	{

		if (j < (width-1))

		{

			Gx1 = Gy1 = Gris[((i-1)*width)+(j-1)];

			Gx2 = Gy2 = Gris[((i-1)*width)+j];

			Gx3 = Gy3 = Gris[((i-1)*width)+(j+1)];

			Gx4 = Gy4 = Gris[(i*width)+(j-1)];      

			Gx5 = Gy5 = Gris[(i*width)+j];

			Gx6 = Gy6 = Gris[(i*width)+(j+1)];

			Gx7 = Gy7 = Gris[((i+1)*width)+(j-1)];

			Gx8 = Gy8 = Gris[((i+1)*width)+j];

			Gx9 = Gy9 = Gris[((i+1)*width)+(j+1)];

			Gx = (Gx1*1)+(Gx2*0)+(Gx3*(-1))+(Gx4*2)+(Gx5*0)+(Gx6*(-2))+(Gx7*1)+(Gx8*0)+(Gx9*(-1));  // This is the value of my matGx and matGy above

			Gy = (Gy1*1)+(Gy2*2)+(Gy3*1)+(Gy4*0)+(Gy5*0)+(Gy6*0)+(Gy7*(-1))+(Gy8*(-2))+(Gy9*(-1));  // I put all values to be well understood

			Sobel[(i*width)+j] = sqrt(powf(Gx,2)+powf(Gy,2));   //I'm not shure, i can do this like that

		}

	}

}

The previous code did the same operations with the same data in all threads.

wanderine · April 11, 2011, 8:46am

It looks like you want to do an ordinary 2D convolution, look at the convolution examples in the CUDA SDK.

Arnibu · April 12, 2011, 9:16am

Thank you for your answers, I’ll check it tonight! External Image

Arnibu · April 25, 2011, 12:54pm

I’m sorry i was very busy during last few days. I tried the kernel you showed to me but it doesn’t work, i get a black picture instead of a sobel filtered picture.

This is the key parts of my code :

both kernels : (first to convert the picture on greyscale and second to convert greyscale to sobel filtered)

__global__ void Nvg(int* Red, int* Blue, int* Green, int* Grey, int N)

{

    int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

    {

        Grey[i] = ((Red[i] * 0.299) + (Blue[i] * 0.114) + (Green[i] * 0.587))/3;

    }

}

__global__ void Sobel(int* Grey, int* Sobel, int height, int width)

{

    int i = blockDim.x * blockIdx.x + threadIdx.x + 1;

    int j = blockDim.y * blockIdx.y + threadIdx.y + 1;

int Gx = 0;

    int Gy = 0;

if(i < (height-1))

    {

	if(j < (width-1))

	{

            Gx = (Grey[((i-1)*width)+(j-1)]*1)

                +(Grey[((i-1)*width)+(j+1)]*(-1))

                +(Grey[(i*width)+(j-1)]*2)

                +(Grey[(i*width)+(j+1)]*(-2))

                +(Grey[((i+1)*width)+(j-1)]*1)

                +(Grey[((i+1)*width)+(j+1)]*(-1));

Gy = (Grey[((i-1)*width)+(j-1)]*1)

                +(Grey[((i-1)*width)+j]*2)

                +(Grey[((i-1)*width)+(j+1)]*1)

                +(Grey[((i+1)*width)+(j-1)]*(-1))

                +(Grey[((i+1)*width)+j]*(-2))

                +(Grey[((i+1)*width)+(j+1)]*(-1));

Sobel[(i*width)+j] = sqrt(powf(Gx,2)+powf(Gy,2));

	}

    }

}

Allocation and copying data :

cutilSafeCall( cudaMalloc((void**)&d_Red, size) );

cutilSafeCall( cudaMalloc((void**)&d_Blue, size) );

cutilSafeCall( cudaMalloc((void**)&d_Green, size) );

cutilSafeCall( cudaMalloc((void**)&d_Grey, size) );

cutilSafeCall( cudaMalloc((void**)&d_Sobel, size) );

cutilSafeCall( cudaMemcpy(d_Red, h_Red, size, cudaMemcpyHostToDevice) );

cutilSafeCall( cudaMemcpy(d_Blue, h_Blue, size, cudaMemcpyHostToDevice) );

cutilSafeCall( cudaMemcpy(d_Green, h_Green, size, cudaMemcpyHostToDevice) );

Call both kernels and copying results:

int threadsPerBlock = 256;

int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

Nvg<<<blocksPerGrid, threadsPerBlock>>>(d_Red, d_Blue, d_Green, d_Grey, N);

threadsPerBlock = 256;

blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

Sobel<<<blocksPerGrid, threadsPerBlock>>>(d_Grey, d_Sobel, height, width);

cutilSafeCall( cudaMemcpy(h_Grey, d_Grey, size, cudaMemcpyDeviceToHost) );

cutilSafeCall( cudaMemcpy(h_Sobel, d_Sobel, size, cudaMemcpyDeviceToHost) );

Then i fill both pictures (Greyscale and Sobel with values from h_Grey and h_Sobel).

Greyscale works fine but Sobel is all black (all values of h_Sobel = 0). I don’t understand why…

External Image

skywwz · April 27, 2011, 12:56pm

my idea is a little different from all of yours, before the kernel starts i bind the input global memory to a texture memory.the texture memory is referenced in kernel.

skywwz · April 27, 2011, 12:58pm

my idea is a little different from all of yours, before the kernel starts i bind the input global memory to a texture memory.the texture memory is referenced in kernel.

if( idx < width && idy <height)
{

	 float  ul;
	 float  um;
	 float  ur;
	 float  ml;
	 float  mm;
	 float  mr;
	 float  ll;
	 float  lm;
	 float  lr;
	 float Horz;
	 float Vert;
	 float Ix2;
	 float Ixy;
	 float Iy2;


	 ul = tex2D( Coffe, (float)(idx-1), (float)(idy-1) );
	 um = tex2D( Coffe, (float)(idx+0), (float)(idy-1) );
	 ur = tex2D( Coffe, (float)(idx+1), (float)(idy-1) );
	 ml = tex2D( Coffe, (float)(idx-1), (float)(idy+0) );
	 mm = tex2D( Coffe, (float)(idx+0), (float)(idy+0) );
	 mr = tex2D( Coffe, (float)(idx+1), (float)(idy+0) );
	 ll = tex2D( Coffe, (float)(idx-1), (float)(idy+1) );
	 lm = tex2D( Coffe, (float)(idx+0), (float)(idy+1) );
	 lr = tex2D( Coffe, (float)(idx+1), (float)(idy+1) );
          ........

pQB · April 28, 2011, 8:03am

Hello Arnibu, i would like add two comments.

1.- Take care of all the boundary of the image :

if ( ( i > 0 && i < width - 1) && ( j > 0 && j < height - 1) )

	{

	....

	}

2.- Run the kernel in a two dimension grid:

#define DIM_BLOCK 16

	// kernel configuration

	dim3 dimBlock(DIM_BLOCK, DIM_BLOCK, 1);

	dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);

I tested your kernel and it works. It’s slow but you will have time for optimization, so dont worry for this right now :)

EDIT: I didnt notice you add +1 to i and j when mapping from threadIdx/BlockIdx to pixel position. Did you consider that when compute the Sobel? You have +1 offset respect to the current pixel. I would remove +1 from the i, j calculation and check it in the if condition i commented above.

Hope this help.

Arnibu · May 2, 2011, 6:00pm

Thank you all for all your answers, i try it NOW! External Image

OMG It works MOUHAHAHAHA (Evil Laught)!

Can i just annoy you one more time ? External Image

What’s the difference between this two kernel calls?

threadsPerBlock = 256;

blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

Sobel<<<blocksPerGrid, threadsPerBlock>>>(d_Gris, d_Sobel, haut, larg);

dim3 dimBlock(DIM_BLOCK, DIM_BLOCK, 1);

dim3 dimGrid(larg / dimBlock.x, haut / dimBlock.y, 1);

Sobel<<<dimGrid, dimBlock>>>(d_Gris, d_Sobel, haut, larg);

BRabbit27 · May 2, 2011, 6:27pm

In the this kernel you have a 1D grid of blocks and threads and you will make reference to the threads in the grid using threadIdx.x, blockIdx.x and so on.

And in this kernel you have a 2D grid, you will make reference to the threads as you would in an matrix (row, col) so here you use threadIdx.x, threadIdx.y, blockIdx.x, blockIdx.y, blockDim.x and so on.

Hope that helps :)

P.S. I’m making some stuff of image processing so if you want I can share with you my sobel code and maybe you can adapt it to your needs !