CUDA and Hough Transform


I’m a beginner with CUDA and I’m trying to implement the Hough transform.

The problem is that I use this function several times with the same parameters the result in imgOut is always different and I don’t understand why.

I think the problem comes from CurrentL and CurrentR but I don’t know why.

Can you explain me, please


//Hough Transform Method

//	@param imgOut

//	@param imgIn : pointer inthe image in the devide

//	@param nr:number of row

//	@param nc:number of column


Hough(int* imgOut, unsigned char* imgIn, int nr, int nc, int Pitch,int rmax, int angle_max)


	int t = 200;

	int yc, xc;

	yc = (int)(nr/2);

	xc = (int)(nc/2);

	double rr = 0.0;

	int ra;

	double conv = 3.1415926535/180.0;

	int CurrentR = 0;

	int CurrentC = 0;


	unsigned char * houghImg= (unsigned char *) (((char *) imgIn)+blockIdx.x*Pitch);


	for ( int i = threadIdx.x; i < nc; i += blockDim.x )


  if(houghImg[i] >= t)



  	CurrentR= 1+(i+blockIdx.x*Pitch)/nc;


  	CurrentC= (int)fmod((double)(i),(double)nc);


  	for(int alpha = 0; alpha < angle_max; alpha++)


    rr =(double)(yc-CurrentR) * sin((double)(alpha*360/angle_max)*conv/2)+(double)(CurrentC-xc) * cos((double)(alpha*360/angle_max)*conv/2);















someone can say me how to implement the Hough transform with CUDA

What I would do is let each thread take care of a pixel in the output image. So it then has to go look in the input-image for all the values to add up. I would access the input image by means of a 2D-texture.

So let’s say you take a blocksize of 16*16, then your grid becomes (output_x_size/16), (output_y_size/16).

Won’t this mean reading each pixel of the input image a thousand times and cause horrible BW stall?

I wrote a version that runs on blocks of input image pixels coordinated with Rho values, say you have a 180 possible Rho values, you run a 1x180 thread block size with (imageWidth/180, imageHeight) grid size, thus I read just 1 time each pixel move this into shared memory and then run on the shared memory and each thread updates the hough plane for it’s designated Rho value. This offers a measly 2 times speedup relative to a CPU OpenMP version, on similar Rho and Theta definition. This runs slow because I’m accessing the global memory output hough plane for each rhoXpixel and this is both horrible and incurable, I could have put the whole hough image in shared memory and then add up but this would require too large a shared memory size. Don’t really know what to try, build a sparse image output.

I looked for any implementation code but found nothing, there’s an OpenVidia GPGPU implemntation which I don’t really get, there’s a cuvi lib with hough implementation but I’m getting no response and there’s no benchmarks yet. I also hear the next NPP will contain a Hough implemntation. I would really rather write it myself so I get control as I need a somewhat specialized version (a,b instead of Rho, Theta). Let me know if there’s anything?