CUDA and Image Processing

If I understand

GridSize = (angleMax, Rmax)

but BlockSize ??

Reduction??? What does it mean. :blink:

check the reduction example, then you will also understand why it is not working in the second case.

I have a new problem

I have a big problem I would like to use a gridSize(384,362) and a blockSize(512,512)
But it’s not running

I have a cudaError at memory location 0x0012eb08… but this error not come to the containt of the function but only in the call of the function.

Indeed, even if the function have code or not the result is this error.

You can have no more than 512 threads per block (and then only when using <10 registers per thread). It is always smart to put a CUT_CHECK_ERROR after your kernel call

I’m trying to implement your solution which use reduction (I use the reduction exemple)

__global__ void

HoughReduction(int* imgOut, unsigned char* imgIn, int nr, int nc,int rmax, int angle_max)

{

	extern __shared__ int sdata[];

	int POSX,POSY;

	int xc = nc/2;

	int yc = nr/2;

	int ra;

	unsigned int tid = threadIdx.x;

	double rr;

	double conv = 3.1415926535/180.0;

	POSY = threadIdx.x;//row

	for (int j = 0; j < nc; j++)

	{

  POSX =j;//column

  //on réalise le calcul

  if (imgIn[threadIdx.x*nc + j] == 0xff)

  {

  	int numAngle = blockIdx.x;

  	int ref = blockIdx.y;

    

  	double angle = (double)((double)numAngle*360.0/(double)angle_max)*(double)conv/2.0;

  

  	double sinus   = sin(angle);

  	double cosinus = cos(angle);

  	

  	rr =((double)yc-(double)POSY)*sinus+((double)POSX-(double)xc)*cosinus;

    

  	

  	if (rr<0.0) {ra=(int)rr;}

  	else {ra=(int)rr+1;}

    

  	if (ra == ref)  	

  	{

             sdata[threadIdx.x*nc + j]++;//results store in shared data

  	}

  }  

  

	}

	//reduction	

	

	__syncthreads();

   for(unsigned int s = 1; s < blockDim.x; s *= 2) {

        int index = 2 * s * tid;

       if (index < blockDim.x) {

            sdata[index] += sdata[index + s];

        }

        __syncthreads();

    }

    

    // write result for this block to global mem

    if (tid == 0) 

    imgOut[blockIdx.x + (rmax+blockIdx.y)*360] = sdata[0];

}

explanation:

For each row of the input image, I do a loop to watch each column of this row.

For each pair I calculate the result with the angle corresponding to blockIdx.x (because I use a grid (alphaMax, Rmax)) and I compare this result with the blockIdx.y

If the two results have the same value, I increment the value corresponding in the the shared data

After that I do a reduction of this shared date and I modify the output Image for the couple (blockIdx.x,blockIdx.y) (==>alpha, r)

But this method does work…

Please Help me External Image

I’m really a newbie

I have found the problem but I don’t understand.

The problem arises when I want modify the shared data or the output pixel

For example

sdata[threadIdx.x*nc + j]++;

or

imgOut[blockIdx.x + (rmax+blockIdx.y)*360] = sdata[0];

are nor possible but I don’t know why.

It crashes? Then you are probably writing past the end of an array.

I don’t uderstand something

I call my function like that

dim3 dimGrid(384,362);

dim3 dimBlock(512,1);

int sizeShared=512*sizeof(int);

  	

HoughReduction<<<dimGrid, dimBlock, sizeShared>>>(...)

And in this function I do this loop

__shared__ sdata[512];

int value =0;

for (int j = 0; j < nc; j++)

{

   ........

   if ()

   {value++;}

}

sdata[threadIdx.x]=value;

The application crashes when the function do that sdata[threadIdx.x]=value; but I don’t know why, because if I increment “value” in the loop but not in the condition (if) there is no probleme.

  1. try declare data as

shared int data[512]

you omitted type! … if final increment of value can be greater than 255 type should be short and if can be greater than 65535 type should be int

  1. compile with -Xptxas -v switches to see mem and regs usages. You need 512*sizeof(int)=2048 bytes of smem but device has only 16K per multiprocessor which are shared over all blocks of that multiprocessor

Thanks

And sorry because it’s a mistake of copy/paste because in my code the declaration is good.

So In fact the probleme come that when I do something in a condition like

if (imgIn[POSY*nc + j] > 0)

{ value ++}

I have a problem to use “value” after that, but if I modify “value” without the condition my function works.

So I don’t understand why “value” become inaccessible.

It’s not coherent.

Check does your value variable is stored in memory or in register. Compile with -keep switch and look in ptx assembler listing. Also as I said earlier use -Xptxas -v switches for compilation to see resources occupied by kernel. Maybe you run out of registers.

thanks,

I would like to know How understand the results of the compilation with -v

the figures of shared memory & registers you can fill in in the occupancy calculator (excel sheet) to see how many threads per block you have to take to get the optimal occupancy. You can also check that way to see if you don’t request too many threads per block.

I don’t understand how to to find iif my variable “value” is in the good memory space :blink:

And there is no reasons that the location of “value” changes.

for (int j = 0; j < nc; j++)

	{

  value ++;  

  

	}

this code works

for (int j = 0; j < nc; j++)

	{

  if (imgIn[POSY*nc + j] > 0)

  {

  	value++;

  }

	}

but not this code

It’s very strange

You need -keep switch which will leave intermediate compilation files on disk. Then open file with ptx extension and find in it your function and check does it increment a register or memory location.

Also you should avoid posting partial code because it lefts potentially bugs in the part which isn’t posted. For example if your imgIn* is declared as int then dereferencing it allows negative values too what you probably don’t need in your condition imgIn[POSY*nc + j] > 0.

In fact imgIn is declare as a unsigned char, I carry out a Sobel filtering on imgIn so after the filtering value in imgIn is 0 or 0xff (255) so imgIn>0 allows me to know if it’s a edge pixel or not.

Please can you say me how change the compilation directive in Visual Studio 2005 v8.0

I have found the problem

__device__ int

computeHoughReduction (int alpha, int r, int POSY, int yc, int xc, int angle_max ,int nc, int rmax,double* sinDevice, double* cosDevice, unsigned char *imgIn){

	

	int ra = 0.0;

	double rr =0.0;

	int test = 0;

	unsigned char seuil = (unsigned char)200.0;

	double sinus =sinDevice[alpha];

	double cosinus =cosDevice[alpha];

	int index= 0;

	for (int j = 0; j < nc; j++)

	{

  index = j+POSY*nc;

    

 {if (imgIn[chose] > seuil)

  {

	

  	rr = ((double)yc-(double)POSY)*sinus+((double)j-(double)xc)*cosinus;

  	ra = (int)ceil(rr);    

  	if (ra == r)  	

  	{

    test++;

  	}  	

  }

  }

	}

	return (test);

}

I think the problem is that imgIn[index] can’t be read

but i on’t why because I have check to acces to imgIn[512*512-1] and it’s good

It the same think for imgIn[0], so index = j+POSYnc can’t be superior that 512512-1 because nc = 512,POSY =threadIdx.x and blockSize = 512 (i.e. threadIdx.x => [0;511])

If someone undertand something , please help me … :biggrin:

This is the last version of my kernel if someone sees an error please help me

I’m working in images 512*512 pixels and I call my kernel like that

int* Outdata;//Hough space in the device mem

unsigned char *odata;//image 512*512 pixels after Sobel filtering in the device mem

int iw; //width

int ih; //height

int rmax;

int angleMax; 

double* sinDevice, cosDevice; //table of sin and cos in the device mem

dim3 dimGrid(384,362);

dim3 dimBlock(512,1);

int sizeShared=512*sizeof(int);

  	

HoughReduction<<<dimGrid, dimBlock, sizeShared>>>(Outdata, odata, ih, iw,rmax, angleMax, sinDevice, cosDevice);

CUT_CHECK_ERROR("Hough Error \n");

I’m despair

In computeHoughReduction I don’t understand something. Indeed I can write imgIn[512] and It works but when I write imgIn[j] with j =>[0;511] it crashes

__device__ int

computeHoughReduction (int alpha, int r, int POSY, int yc, int xc, int angle_max ,int nc, int rmax,double* sinDevice, double* cosDevice, unsigned char *imgIn){

	

	int test = 0;

	unsigned char seuil = (unsigned char)200.0;

	

        for (int j = 0; j < nc; j++)

	{

    

  if (imgIn[j] > seuil){test++;}// for the test I try to count the number of edge pixel

  

	}

	return (test);

}

//Hough transform with reduction

/////////////////////////////////////////////////////////////////

__global__ void

HoughReduction(int* imgOut, unsigned char* imgIn, int nr, int nc,int rmax, int angle_max, double* sinDevice, double* cosDevice)

{

	int size = angle_max*rmax;

	

	int POSX,POSY;

	int xc = nc/2;

	int yc = nr/2;

	int ra =0;

	double rr = 0.0;

	unsigned int tid = threadIdx.x;

	int value = 0;

	

	double conv = 3.1415926535/180.0;

	unsigned char *intermediaire = (unsigned char *) (((char *) imgIn)+threadIdx.x*nc);

	

	

	POSY = threadIdx.x;//row

	

	value = computeHoughReduction(blockIdx.x, blockIdx.y, threadIdx.x, yc, xc, angle_max , nc, rmax, sinDevice, cosDevice, intermediaire);

	

	__shared__ int sdata[512];

	sdata[threadIdx.x] = value;

	

	__syncthreads();

   for(unsigned int s = 1; s < blockDim.x; s *= 2) {

        int index = 2 * s * tid;

       if (index < blockDim.x) {

           sdata[index] += sdata[index + s];

        }

        __syncthreads();

    }

    

    // write result for this block to global mem

    if (tid == 0) 

    imgOut[blockIdx.x +(rmax + blockIdx.y)*360]=sdata[0];

    

    

}

please External Image External Image External Image

huh,

you defined 139008 ??? blocks where each contains 512 threads resulting in 71172096 threads. Accessing image’s data where threadIdx.x represents the row of image means each row is accessed 139008 times because each block has it’s own threads such it represents the row of image. In your test function where you parse row of image finding all bytes contain value larger than 200 will result with each block will process complete image (all 512 rows) because blockIdx is not used in calculation.
You tried using reduction on Mark Harris way but it must be modified to work properly for 2D grid. PDF in reduction example assume you are using 1D grid.

.

thanks

It the algorithme that I’m trying to implement

It 's what i’m doing

each BLOCK do one output-pixel => it’s necessary for me to have 384 angles and 362 r so I must have a grid (384,362)

I think I’m carrying out this algoritme right

I’m using an intermadiate unsigned char * for concentrate my calculation in a specific row define by the threadIdx.x

But i thought that the reduction was done for a block and the block is 1D.

If you know a algoritme more efficient, can you explain me

But I think it not logical that I can access to imgIn

imgIn[POSY*nc] => good

imgIn[POSY*nc+511] => good (511 last value to j)

imgIn[POSY*nc +j] =>crash