Circle Hough Transform using CUDA

I am looking to improve the performance of my Circle Hough Transform algorithm and therefore I have resorted to using CUDA. I’m currently using atomic operations in global memory to carry out the voting which I know can be slow. However, I am unsure how to store the votes and the coordinate locations of those votes in shared memory so that I can copy the results back to global memory.

Has anyone got any suggestions/advice on how best to do this in order to achieve maximum performance for the CHT?

Many thanks in advance everyone for your time.

Is it actually slow? (compared to other programs or your own performance goals?)

Hough Transforms have a lot of similarity to histogramming, which I’ve spent some time implementing recently. I found that global atomic operations on Kepler are so fast that I can’t seem to beat them with more complex approaches.

Not sure if you’re come across any of these, but if not, I’m sure they’ll be helpful.

Not in CUDA, but there is OpenCL code here you might want to take a look at:
http://www.cis.upenn.edu/~seupark/projects/ocl_hough_circle/final_report.pdf

Here is a poster of an implementation on CUDA:
http://on-demand.gputechconf.com/gtc/2012/posters/P0438_ht_poster_gtc2012.pdf

Here is a more generic paper:
http://parse.ele.tue.nl/system/attachments/21/original/Fast%20Hough%20Transform%20on%20GPUs%20-%20Exploration%20of%20Algorithm%20Trade-offs.pdf

seibert - I need the algorithm to be as fast as possible - under 2ms in order to achieve frame rates of 450fps. At the minute, I’m achieving 21ms for a 256x256 image. Do you have any I could use in order to implement my CHT?

vacaloca - Yes I have seen all three of these but non of them seem to apply to circles

To put execution times into perspective, you might want to state which GPU they were achieved on.

Actually, more details would be helpful, because often getting better performance is about taking advantage of special circumstances:

  • What are the dimensions of your accumulator array?
  • What GPU are you using? (as njuffa asked)

Ok, here are some more details. I wish to utilise shared memory for the voting step but am unsure how to store the votes and the coordinates of the votes in shared memory simulataneously. The dimensions of my accumulator array are the same as the image dimensions (256x256 pixels) and the GPU I am using the Geforce GT 620 with compute capability 2.1. The CUDA kernel that I am currently using is shown below:

global void
cudaHough(const int *edgeVals, int * output, int numElements, int radius)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;

if (idx < numElements)
{
	for(float theta = 0.0; theta < 180.0; ++theta)
	{
		if(edgeVals[idx] > 0)
		{
			int centerX = idx % Constants::SUBIMAGE_WIDTH_HEIGHT + (radius *   std::cos(theta));
			int centerY = ((idx - centerX) / Constants::SUBIMAGE_WIDTH_HEIGHT) + (radius * std::sin(theta));

			if(centerX < Constants::SUBIMAGE_WIDTH_HEIGHT && centerY < Constants::SUBIMAGE_WIDTH_HEIGHT)
			{
				atomicAdd(&output[0] + (centerY * Constants::SUBIMAGE_WIDTH_HEIGHT) + centerX, 1);
			}
		}
	}
}

}

You are using a float in a for loop(theta), and applying the ++ operation as if it were an integer. I am surprised that works at all, but the simple fix would be to start it as an int. Then later cast to float for the use by the sin() and cos() function.

Yeah sure, the only reason I had it as a float was so I could experiment with smaller increments of theta, which increase the accuracy of the CHT but makes it computationally more intensive.

Another thing I can think of is that you’re not using sinf & cosf, see if that helps cut down some of the time, assuming nvcc isn’t smart enough to do it implicitly for you. For that matter, you should be calling sincosf to do both operations at the same time:

http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDA__MATH__SINGLE_g9456ff9df91a3874180d89a94b36fd46.html

This is odd: Theta is presumably an angle. Why is it running from 0 to 180? That seems to suggest it’s in degrees, while the trig functions take their inputs in radians.

Since the angle seems to be incremented in some finite number of fix-sized increments, you might want to look into precomputing the trig function values and storing them in a table. Using __sincosf() for the trig functions may be competitive with a table, so why not try it both ways?

The GT620 is a very low-end device, you might want to look into using a higher performance GPU with sm_3x architecture if you have trouble reaching the desired performance with software optimization alone.

The GT 620 is going to be extremely slow, both due to the small number of CUDA cores and the speed of atomics in compute capability 2.1. It isn’t obvious whether the atomic operations or the transcendental functions are the bottleneck with a GPU this slow. Getting a Kepler card (beyond the code suggestions already mentioned above) will probably be the easiest way to speed this up.

I did a quick test with PyCUDA using an edges array that contains random pattern with 10% of the elements set to 1 (making the above fixes to correctly use radians and sincosf). The time will ultimately depend on the pattern and number of atomic accesses. For this particular data, I get the following time per frame (not counting time to send or receive data):

GT 650M: 6.6 ms
GT 640 (GK208): 4.7 ms
GTX 580: 1.2 ms
GTX 680: 1.1 ms

@seibert: That’s useful comparison data. Did you use sincosf() in your code, or __sincosf()? From what I know, for many signal processing applications the device functions work well enough, and provide a nice performance boost.

Hi guys, many thanks for your suggestions, I definitely think using sincosf() or __sincosf() is the way to go, as is pre-computing these values and storing them in a table. Do you guys know or know anyone who can perhaps help me do the voting in shared memory? I agree my device is very low-end, but I would rather optimise the software first before investing in new hardware. Cheers.

This might not be the best approach. Newer hardware is not only faster, it is different, and behaves differently. Something that is faster in shared memory on Fermi may end up being slower than a “less clever” atomic operation approach on Kepler.

This is true. Every time I have tried using atomics on shared memory on Kepler I have gotten a slower implementation than just using global memory atomics.

I used sincosf(). If I switch to __sincosf():

GTX 580: 1.0 ms
GT 640 (GK208): 4.4 ms
GTX 680: 1.0 ms

I find it curious that the GTX 580 and 680 have the same speed, leading me to believe I’m hitting some other bottleneck in the benchmark.