Circle Hough Transform using CUDA

bazedwards · July 30, 2013, 11:51am

I am looking to improve the performance of my Circle Hough Transform algorithm and therefore I have resorted to using CUDA. I’m currently using atomic operations in global memory to carry out the voting which I know can be slow. However, I am unsure how to store the votes and the coordinate locations of those votes in shared memory so that I can copy the results back to global memory.

Has anyone got any suggestions/advice on how best to do this in order to achieve maximum performance for the CHT?

Many thanks in advance everyone for your time.

seibert · July 30, 2013, 3:41pm

Is it actually slow? (compared to other programs or your own performance goals?)

Hough Transforms have a lot of similarity to histogramming, which I’ve spent some time implementing recently. I found that global atomic operations on Kepler are so fast that I can’t seem to beat them with more complex approaches.

vacaloca · July 30, 2013, 3:44pm

Not sure if you’re come across any of these, but if not, I’m sure they’ll be helpful.

Not in CUDA, but there is OpenCL code here you might want to take a look at:
[url]http://www.cis.upenn.edu/~seupark/projects/ocl_hough_circle/final_report.pdf[/url]

Here is a poster of an implementation on CUDA:
[url]http://on-demand.gputechconf.com/gtc/2012/posters/P0438_ht_poster_gtc2012.pdf[/url]

Here is a more generic paper:
[url]http://parse.ele.tue.nl/system/attachments/21/original/Fast%20Hough%20Transform%20on%20GPUs%20-%20Exploration%20of%20Algorithm%20Trade-offs.pdf[/url]

bazedwards · July 30, 2013, 5:49pm

seibert - I need the algorithm to be as fast as possible - under 2ms in order to achieve frame rates of 450fps. At the minute, I’m achieving 21ms for a 256x256 image. Do you have any I could use in order to implement my CHT?

vacaloca - Yes I have seen all three of these but non of them seem to apply to circles

njuffa · July 30, 2013, 6:29pm

To put execution times into perspective, you might want to state which GPU they were achieved on.

seibert · July 30, 2013, 7:10pm

Actually, more details would be helpful, because often getting better performance is about taking advantage of special circumstances:

What are the dimensions of your accumulator array?
What GPU are you using? (as njuffa asked)

bazedwards · July 30, 2013, 7:35pm

Ok, here are some more details. I wish to utilise shared memory for the voting step but am unsure how to store the votes and the coordinates of the votes in shared memory simulataneously. The dimensions of my accumulator array are the same as the image dimensions (256x256 pixels) and the GPU I am using the Geforce GT 620 with compute capability 2.1. The CUDA kernel that I am currently using is shown below:

global void
cudaHough(const int *edgeVals, int * output, int numElements, int radius)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;

if (idx < numElements)
{
	for(float theta = 0.0; theta < 180.0; ++theta)
	{
		if(edgeVals[idx] > 0)
		{
			int centerX = idx % Constants::SUBIMAGE_WIDTH_HEIGHT + (radius *   std::cos(theta));
			int centerY = ((idx - centerX) / Constants::SUBIMAGE_WIDTH_HEIGHT) + (radius * std::sin(theta));

			if(centerX < Constants::SUBIMAGE_WIDTH_HEIGHT && centerY < Constants::SUBIMAGE_WIDTH_HEIGHT)
			{
				atomicAdd(&output[0] + (centerY * Constants::SUBIMAGE_WIDTH_HEIGHT) + centerX, 1);
			}
		}
	}
}

}

CudaaduC · July 30, 2013, 7:42pm

bazedwards:

global void
cudaHough(const int *edgeVals, int * output, int numElements, int radius)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;

if (idx < numElements)
{
	for(float theta = 0.0; theta < 180.0; ++theta)
	{
		if(edgeVals[idx] > 0)
		{
			int centerX = idx % Constants::SUBIMAGE_WIDTH_HEIGHT + (radius *   std::cos(theta));
			int centerY = ((idx - centerX) / Constants::SUBIMAGE_WIDTH_HEIGHT) + (radius * std::sin(theta));

			if(centerX < Constants::SUBIMAGE_WIDTH_HEIGHT && centerY < Constants::SUBIMAGE_WIDTH_HEIGHT)
			{
				atomicAdd(&output[0] + (centerY * Constants::SUBIMAGE_WIDTH_HEIGHT) + centerX, 1);
			}
		}
	}
}

}

You are using a float in a for loop(theta), and applying the ++ operation as if it were an integer. I am surprised that works at all, but the simple fix would be to start it as an int. Then later cast to float for the use by the sin() and cos() function.

bazedwards · July 30, 2013, 7:46pm

Yeah sure, the only reason I had it as a float was so I could experiment with smaller increments of theta, which increase the accuracy of the CHT but makes it computationally more intensive.

vacaloca · July 30, 2013, 9:59pm

Another thing I can think of is that you’re not using sinf & cosf, see if that helps cut down some of the time, assuming nvcc isn’t smart enough to do it implicitly for you. For that matter, you should be calling sincosf to do both operations at the same time:

[url]http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDA__MATH__SINGLE_g9456ff9df91a3874180d89a94b36fd46.html[/url]

njuffa · July 30, 2013, 11:28pm

This is odd: Theta is presumably an angle. Why is it running from 0 to 180? That seems to suggest it’s in degrees, while the trig functions take their inputs in radians.

Since the angle seems to be incremented in some finite number of fix-sized increments, you might want to look into precomputing the trig function values and storing them in a table. Using __sincosf() for the trig functions may be competitive with a table, so why not try it both ways?

The GT620 is a very low-end device, you might want to look into using a higher performance GPU with sm_3x architecture if you have trouble reaching the desired performance with software optimization alone.

seibert · July 31, 2013, 2:10am

The GT 620 is going to be extremely slow, both due to the small number of CUDA cores and the speed of atomics in compute capability 2.1. It isn’t obvious whether the atomic operations or the transcendental functions are the bottleneck with a GPU this slow. Getting a Kepler card (beyond the code suggestions already mentioned above) will probably be the easiest way to speed this up.

I did a quick test with PyCUDA using an edges array that contains random pattern with 10% of the elements set to 1 (making the above fixes to correctly use radians and sincosf). The time will ultimately depend on the pattern and number of atomic accesses. For this particular data, I get the following time per frame (not counting time to send or receive data):

GT 650M: 6.6 ms
GT 640 (GK208): 4.7 ms
GTX 580: 1.2 ms
GTX 680: 1.1 ms

njuffa · July 31, 2013, 2:33am

@seibert: That’s useful comparison data. Did you use sincosf() in your code, or __sincosf()? From what I know, for many signal processing applications the device functions work well enough, and provide a nice performance boost.

bazedwards · July 31, 2013, 7:31am

Hi guys, many thanks for your suggestions, I definitely think using sincosf() or __sincosf() is the way to go, as is pre-computing these values and storing them in a table. Do you guys know or know anyone who can perhaps help me do the voting in shared memory? I agree my device is very low-end, but I would rather optimise the software first before investing in new hardware. Cheers.

Ailleur · July 31, 2013, 4:06pm

This might not be the best approach. Newer hardware is not only faster, it is different, and behaves differently. Something that is faster in shared memory on Fermi may end up being slower than a “less clever” atomic operation approach on Kepler.

seibert · July 31, 2013, 4:09pm

This is true. Every time I have tried using atomics on shared memory on Kepler I have gotten a slower implementation than just using global memory atomics.

seibert · July 31, 2013, 4:12pm

I used sincosf(). If I switch to __sincosf():

GTX 580: 1.0 ms
GT 640 (GK208): 4.4 ms
GTX 680: 1.0 ms

I find it curious that the GTX 580 and 680 have the same speed, leading me to believe I’m hitting some other bottleneck in the benchmark.

Topic		Replies	Views
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	8956	August 3, 2017
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6063	November 16, 2010
Need advice with Hough transform implementation howto sum blockwise result in global mem CUDA Programming and Performance	12	13643	September 13, 2008
CUDA & Smoothed Particle Hydrodynamics Best approach? CUDA Programming and Performance	19	9975	June 24, 2009
CUDA very slow performance CUDA Programming and Performance	21	16743	March 6, 2020
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
Using Shared Memory in CUDA C/C++ Technical Blog	36	1994	October 8, 2020
A faster and more accurate implementation of sincosf() CUDA Programming and Performance	25	9114	August 6, 2017
[Beginner]: CUDA slower than serial implementation fill Operation on entire image CUDA Programming and Performance	18	13519	September 15, 2011
Is this a task in which CUDA could speed up things? CUDA Programming and Performance	11	11765	September 29, 2009

Circle Hough Transform using CUDA

Related topics