CUDA and Image Processing

Hello,

I want to use CUDA to implement image processing treatment.

So I load an image with cutLoadPNG() in an unsigned char*

I would like one thread per pixel and I would like to access to the pixel position.

I have tried to use the Sobel Filter example but I have not succeed.

unsigned char *pSobel = 

      (unsigned char *) (((char *) pSobelOriginal)+blockIdx.x*Pitch);

for ( int i = threadIdx.x; i < w; i += blockDim.x ) {...}

I thought that threadIdx.x match to the x coordinate and blockIdx.x match to the y coordinate, but No External Image

External Image Help

Try to imagine this way, your image is “covered” with squared regions. And each square is a single blocks and all pixels in that square will be processed from thread from that block.For convenience imagine picture is in forth quadrant meaning pixel with coordinate (0,0) is at the topmost left corner on the screen.

Now, number of blocks is defined with dimensions of the grid. For example if you define Grid as (16,9,1) it means you will have 16x9=141 blocks in it and try to imagine those blocks are rectangles region pattern on your image. Now, each block has a bunch of threads defined by dimensions of the block. So, if you define block like (32,32,1) you are defined max number of threads (in this case 32x32=1024 threads) per block. Now threadIdx.x and threadIdx.y are pixel local coordinates in that block. Each block (rectangle on image) has its own position on image counted in pixels. Don’t confuse this with block coordinates in the grid.

To calculate block position on image (in pixels) you need to multiply block dimension with block size so

BlockPosInPixels.x = BlockDim.x * BlockIdx.x

BlockPosInPixels.y = BlockDim.y * BlockIdx.y

Think on this coordinate as coordinate of topmost left pixel in each rectangle .

Now you only need to add local pixel position and thats it.

PixelPos.x = BlockDim.x * BlockIdx.x + ThreadIdx.x

PixelPos.y = BlockDim.y * BlockIdx.y + ThreadIdx.y

That way is easier for understanding but could lead to inefficient block execution (squares on rightmost side and bottommost side of image) if image resolution is not perfectly match multiplied block size or with other word

if (width % BlockDim.x != 0) || (height % BlockDim.y) != 0 You have those border blocks, actually their multiprocessors are not utilized 100% .

One slightly complicated way would be to imagine picture as 1D array That way you don’t have problems like that (actually you could have only last block not utilized 100% if block size is not perfect divisor of 1D array) but relations between coordinates and indexes are complicated.

All this assume that you did convert your png (compressed) image to standard bitmap representation.

sorry for my bad english

thanks
External Image

In fact I must have a thread for a pixel.

So if I use 10241024 images with a blocksize 3232 , the grid size must be also 32*32, if I understood.

But the loop is it adapted to this action???

When you have a thread for each pixel, why would you loop?

That is my Kernel, for create that I have used the Sobel Filter example in the CUDA SDK 1.1 where a loop is used

//------------------------------------------------------------------------------

//Hough Transform Method

//	@param imgOut

//	@param imgIn : pointer inthe image in the devide

//	@param nr:number of row

//	@param nc:number of column

//------------------------------------------------------------------------------

__global__ void 

Hough(int* imgOut, unsigned char* imgIn, int nr, int nc, int Pitch,int rmax, int angle_max,double rr){	

  

	unsigned char * houghImg= (unsigned char *) (((char *) imgIn)+blockIdx.x*Pitch);

	

	for ( int i = threadIdx.x; i < nc; i += (int)blockDim.x ){

	

	int POSX = blockDim.y * blockIdx.y + threadIdx.y;

	int POSY = blockDim.x * blockIdx.x + threadIdx.x;

	

	int y = POSY*nc + POSX;

  

	if(houghImg[i] >= t){	

  

  for(int alpha = 0; alpha < angle_max; alpha++){

  	

                        indice = computeHough (...);

  	imgOut[indice]++;

    

  	}	

  }

	

	}

}

and I call this function like that and the image has a size of 512*512

dim3 bloc = (16,16);

dim3 gril = (16,16);

  	

Hough<<<gril, bloc>>>(...);

In fact, My problem is I use this function but the result Is always diffrent.

But I don’t now why

If someone knows how do that without loop

Please could you explain me

I’m trying to carry out my function without loop like that

Hough(int* imgOut, unsigned char* imgIn ,...){	

	

	unsigned int POSX = blockDim.y * blockIdx.y + threadIdx.y;

	unsigned int POSY = blockDim.x * blockIdx.x + threadIdx.x;

	

	unsigned char * houghImg= (unsigned char *) (((char *) imgIn)+blockIdx.x*Pitch);

	

	int y = POSY*nc + POSX;

	

	imgOut[1]=y;

}

But if I print the value of imgOut[1] is already diffrent , but it must have already the same value ( the last pixel of the image) but it 's not like that

In a 512*512 pixels images the results after some executions is

y = 65024

y = 32256

y = 130560

y = 15072

y = 40448

y = 32256

WHY External Image

please help me External Image

You can not write something like that! All run threads will try to write to that same location 1 which is not allowed. In such situation only one of them will succeed but you don’t know which. Because of that you have different results.

From your first post it was looked like you are asking about link between image coordinates and image stored in Cuda2DArray which would be much easier for you to handle but you are actually trying to access image as 1D array.

From that point it will be easier for understanding if you define onedimensional grid and block size. Imagine that like your image is covered with lines instead of rectangles. So each row could be one line (or more but keep this example simple until you get the picture) Number of rows will be dimension of your grid.

So if grid is defined like (512,1,1) in that case means you have 512 rows

and block defined like (512,1,1) means you will have 512 threads per block and it will represent single row.

Take a brake for the moment, you declared char* ImgIn. means during address calculation you need to include PixelSize in calculation. If your bitmap is RGB 24bits for example then

*ImgIn[0] will be R component of first pixell

*ImgIn[1] will be G component of first pixel

*ImgIn[2] will be B component of first pixel

and so on

In this example your pixel is 3 byte size, if you use alpha then it will be 4 byte size, if your image channels are of type float then size of pixel RGB will be of 12 bytes RGBA will be 16 bytes

Pitch is actually width * PixelSize increased to larger proper value to satisfy alignment requirements. So instead using Width*PixelSize in your calculation you will use Pitch.

now reading pixel from position (x,y) will be:

YourPixelType* pixel = (char*) (ImgIn + yPitch + xsizeof(YourPixelType))

now because your BlockIdx.x is y coordinate of image and ThreadIdx.x is x coordinate

you can write:

int index = BlockIdx.xPitch + ThreadIdx.xsizeof(YourPixelType);

Supposing your output image is of same pixel size and same resolution like input image

ImgOut[index] = ImgIn[index];

will just copy source pixel to output image

Thanks

Can I use that ???
index[1]++

Because in fact it’s a Hough transform implementation and I want count pixels with the same parameter

???

My new function is like that

__global__ void 

Hough(int* imgOut, unsigned char* imgIn, int nr, int nc, int Pitch,int rmax, int angle_max,double rr){	

	

	unsigned int POSX = threadIdx.x;

	unsigned int POSY = blockIdx.x;	

  

	if(POSX< nc  && POSY< nr)

	{

  unsigned int index = POSY*Pitch + POSX*sizeof(unsigned char);

  

  if(imgIn[index] >= t)

  {	

  	for(int alpha = 0; alpha < angle_max; alpha++)

  	{

    indice = computeHough (alpha,rr, POSY,POSX, yc,xc, angle_max, rmax); // function which return a integer

    

    imgOut[indice]++;

  	}

  }

	}

}

And I call it like that

Hough<<<512,512>>>(…) and pitch= Width*sizeof(unsigned char)

Could you say me if it ’ s good because the results not cohrent

you cannot have more than 1 thread writing at the same time to the same memory location. It is running parallel, so you will get undefined results (as you have noticed) So what you normally do is :

  • generate all the information in a much larger matrix (3D) and then afterwards do a reduction (sum) to get a 2D matrix.
    or
  • have each thread calculate the value for 1 output-pixel, since within a thread you can sum all you want.

thanks for all

External Image External Image External Image External Image External Image External Image External Image External Image External Image External Image External Image

I understand why my result was never the same because if two pixels tryed to increment the same value in the same time the calculation was done just for a of these pixels.

So finally as the order of threads is always diffrent the diffrent line found were never the same.

External Image

Now, Index seems to be ok because every thread will calculate different value for it but you didn’t provide code for computeHough. You said it returns an integer indice but that integer must be different for each thread (like Index) if you want to use it as index for writting to OutputImage.

EDIT:

You were faster typing your previous post :)

The diffrent thread can’t write in the same memory space in the same time but can they read the same memory space???

Yes, if you write to memory from different threads you get undefined behaviour (who will win), but reading the same location, every thread reads the same value

Ok thanks

I finally understood the logic of CUDA

It 's running but I have been forced to change the algorithme

If something can give me the algorithme of Hough transform to the line detection in parallel it will be awesome because my implementation is very slow

In fact for each pixel in the image a call a CUDA function which calculate for each angle the distance to the origine like that

for(int column = 1;column < numberColumn; column++)

	{

  for(row = 1; row <numberRow; row ++)

  {

  	if(imgIn[column+(row*numberColumn)]>= threshold)

  	{

    HoughKernel<<<1,384>>>(imgOut,column,row,xc,yc, rmax, anglemax);

  	}

  }

	}

// the CUDA method

/****************************************************************/

__global__ void 

HoughKernel(int *imgOut,int c,int r,int xc,int yc,int rmax,int anglemax)

{

	int numAngle = threadIdx.x + blockIdx.x*blockDim.x;

	int ra;

	double conv = 3.1415926535/180.0;

	

	double angle = (double)((double)numAngle*360.0/(double)anglemax)*(double)conv/2.0;

	

	double sinus   = sin(angle);

	double cosinus = cos(angle);

	

	double rr =((double)yc-(double)r)*sinus+((double)c-(double)xc)*cosinus;

  

	if (rr<0.0) {ra=(int)rr;}

	else {ra=(int)rr+1;}

	

	imgOut[numAngle+((rmax+ra)*360)]++;

}

Please :blink:

I would do it like this:

Hough transform = x,y → r,alpha. So I would let each thread calculate one r,alpha element. If you put the input-image in a 2D-texture, you would just have each thread calculate all the x,y pairs for x = [0 num_x]. Then you can have the texture-unit do an interpolation in the input-image for each of those x,y pairs. (You can even have x go from 0 till num_x in steps smaller than 1).

What you can also do is have each block do one r,alpha element. And have each thread of the block do one x,y pair (or more than one, depends on the x_size of your input image). then you will have to add in shared memory & at the end of the kernel do a reduction to get the total value for that r,alpha element.

External Image

thanks,

Can tou correct me,

I would like to have each block do one pixel (i.e. for an image 512*512 the grid dimension is (512,512) ) and each thread do one (r,alpha) . Like that I can icrement my result table without risk of writing in a same memory space.

I have the same problem but I don’t understand why

When I do that It’s running perfectly

__global__ void 

for(c=1;c<numberColumn;c++)

	{

  for(r=1;r<numberRow;r++)

  {

  	if(imgIn[c+(r*nc)]>=t)

  	{

    HoughKernel<<<1,384>>>(imgOut,c,r,xc,yc,rmax, anglemax);

  	}

  }

	}

HoughKernel(int *imgOut,int c,int r,int xc,int yc,int rmax,int anglemax)

{

	int numAngle = threadIdx.x + blockIdx.x*blockDim.x;

	int ra;

	double conv = 3.1415926535/180.0;

	

	double angle = (double)((double)numAngle*360.0/(double)anglemax)*(double)conv/2.0;

	

	double sinus   = sin(angle);

	double cosinus = cos(angle);

	

	double rr =((double)yc-(double)r)*sinus+((double)c-(double)xc)*cosinus;

  

	if (rr<0.0) {ra=(int)rr;}

	else {ra=(int)rr+1;}

	

	imgOut[numAngle+((rmax+ra)*360)]++;

}

But When I do that, which is the same but I go in my image in the Device

I have the result always change

dim3 dimGrid(512,512,1);

dim3 dimBlock(384,1,1);

  	

Hough<<<dimGrid, dimBlock>>>(Outdata, odata, ih, iw,rmax, 384);

__global__ void 

Hough(int* imgOut, unsigned char* imgIn, int nr, int nc,int rmax, int angle_max){

	double rr = 0.0;

	int POSY = blockIdx.x;

	int POSX = blockIdx.y;

	int yc = nr / 2;

	int xc = nc / 2;

	

	if (imgIn[POSX+POSY*nc]==0xff)

	{

	

  int numAngle = threadIdx.x;

  int ra;

  double conv = 3.1415926535/180.0;

 double angle = (double)((double)numAngle*360.0/(double)angle_max)*(double)conv/2.0;

	

  double sinus   = sin(angle);

  double cosinus = cos(angle);

	

  rr =((double)yc-(double)POSY)*sinus+((double)POSX-(double)xc)*cosinus;

	

  if (rr<0.0) {ra=(int)rr;}

  else {ra=(int)rr+1;}

	

  imgOut[numAngle+((rmax+ra)*360)]++;

	}

}

It 's the same thing but It does not return the same result External Image

this makes no sense unless some block are execute in the same time, but I tought that that block are execute one by one.

External Image

No, it is not the same. In the first case you have one block running at the same time, in the second case 512*512 blocks that will all try to write to the same output-pixel at the same time.

You should have each thread do one output-pixel (so one r,alpha) and loop within the thread over all of the input-image for an easy method.

Or (faster but more tricky) you should have each BLOCK do one output-pixel. Then you can have the threads in a block do the various input-pixels (each will do num_input_pixels/num_threads_per_block) and then within the block calculate the sum by means of a reduction, and then write the result of the reduction to the output-pixel.