Need advice with Hough transform implementation howto sum blockwise result in global mem


UPDATE: Code is now available, see a few posts below (in the post with the attached screenshot)

I am currently working on a fast Hough transform implementation for computer vision. This will be used on satellite imagery to detect the typical US urban road layout. Basically my program will look for signs of civilization on services like Google Maps. The goal is to create synthetic night lighting like this from aerial images taken in daylight.…703624/sizes/l/

(this is an actual aerial photography).

Background: A useful application of the Hough transform is to detect straight lines in a contour image, which may for example result from running an edge detection algorithm over a photograph. A Hough transform transforms each pixel (x,y) from the source image into a sine wave in parameter space (rho, theta) representing all lines that may run through the pixel (x,y).

In my implementation, each thread block computes the entire parameter range rho,theta for a specific square from the source image. Computation is performed entirely in floating point. Due to the restricted size of shared memory (only ~4096 floats) I have to do several passes across distinct subsets of theta within each thread.

After each pass I have to accumulate my results in global memory because I must re-use my precious and limited shared memory for a different subset of theta. Since my results are floating point, I cannot use atomic adds. Duh. So what should I do?

If I simply use the += operator on variables in global memory, I fear that two thread blocks accessing the same location in the result memory would collide and run into a race condition, such that results from one or more thread blocks are not added to the total tally.

Do you think it is feasible to implement some kind of inter-block semaphore or mutex mechanism to emulate blockwise atomic adds? Say, if one thread block needs to accumulate its results for the first sub-range of theta in global memory space, then the other blocks either would have to wait before doing the same or they would compute and accumulate results for a different range first, until they can acquire the mutex for the first sub-range.

Has anyone posted readily available macros for mutexes in global memory on these forums yet?

What I would like to avoid is having to dump my results to separate memory spaces for each thread block, like it is done in the 256 histogram sample (when no atomic support is available). In my case there may be many blocks required for a typical 512x512 pixel source image, so this is simply not feasible - to much result memory required!

Can a thread block detect on which SM it is executed? (last time somone said “No”, but I would like to get a few more “Forget that” statements before I belive it). Then on a machine with 12 SMs I would only need 12 distinct result buffers in global memory regardless of the number of thread blocks used in the grid.

Another idea: each executing thread block could atomically increment a global counter. Then I take this counter modulo he number of SMs on the GPU… All thread blocks running simultaneously on the same GPU will then have different counter values at any given time, which I could use as an index into the result buffer. A GPU with 12 SMs would then only need 12 result buffers. These can finally be summated with another kernel call. Is this feasible?

BTW Is anyone interested in my Hough transform implementation when it is done? I hope to get it down to about 60ms for a 512x512 grayscale source image (with a resulting parameter space of about half the source pixel count). Without summating any results in global memory yet, I currently get about 30ms of execution time on a G92 chip (8800GT)


I can tell you again that no, you cannot find out on which SM you are running. If you post your code I can take a look, I have programmed a hough transform in the past (non-cuda).Not so long ago someone else also took a stab at a hough transform in CUDA, you can probably find it on the forum.

A possible way to make this all work better (without needing artificial mutexes and other stuff that is likely to be error-prone) is the following:

Let each block calculate a theta-rho block of the output image. Then you load in the sourcedata coalesced into shared memory in chunks until you are done (for loop). That way each thread only writes to one specific output-location and you don’t have these problems. You could even do the addition in shared memory and only write at the end (will almost certainly be faster).

So you have
shared float imagedata[blocksize];
shared float result[blocksize];

This way you have less memory reads&writes as in your solution (you have less reads & more writes), so you should get higher performance (as you will be memory-bound).

A short calc of the possible performance when you have each thread calculate an output-pixel:

512x512x4 = 1 Mb

Let’s say you make 256x256 output size, then you can have 256 blocks of 256 threads
So you need to read your source-image 256 times.
2565125124 = 256 Mb
You write
256*4 = 256 kb.

When you reach 50 Gb/s mem bandwidth, that comes down to 5 msec minimum kernel time.

Your vote has been counted… once ;) I was looking for some larger sample size, hehe.

There must be a way to label each SM uniquely. Are parts of the shared memory contents persistent across thread blocks that are executed on a particular SM? If so, one could place an ID in shared mem. This persistent ID would only be assigned only once when the first thread block is assigned to the SM. By using a counter in global mem I could keep track how many blocks have been scheduled in total. This allows me to decide whether or not it is the first thread block scheduled on a SM.

How is that going to work efficiently, considering that one pixel from the source image clobbers all over the theta-rho space? The only way I see this working is to make each thread block scan the entire source image. Even when reading from a texture, this consumes an unfortunate amount of memory bandwidth.

But then, in my solution each thread block will also have to read and write the entire result image. Not sure if this is better or worse compared to your solution. Your solution would have to compute the sine/cosine a lot more than mine, that’s for sure. But anyway, we’re bandwidth limited here - not compute limited.

I don’t think I will change my fundamental design now, considering that I’ve got most of the code done (it works in Emu mode) - except for the final accumulation of block results in global mem. So I will have to find some solution to the originally stated problem. But thanks for your input. If I had to re-do it from scratch I would probably try your design.

I was just able to get the number of blocks in the grid down to 11 (a matter of changing a few #defines). Each source block now handles a vertical stripe of 48x512 pixels from the source image and occupies 384 threads.

With this approach I will require much less reads/writes to and from the result buffers because most accumulation happens in shared memory. However I will stress (and probably blow) my texture cache when reading the input from the 2D texture.

So I am getting closer to a workable solution. 11 blocks per grid means that I may be using the same method as used in the histogram256 sample for final accumulation.

However 11 blocks per grid would not even utilize the GPU fully, so I may have to get that number up to at least the number of SM’s on the GPU.


That doesn’t change the fact it is true ;)

In general it is always wise to let your grid represent output-space so you do not have problems adding things up.

Well, I did the calculation for you that it would take around 5 msec for your kernel this way. That looks much better than the 60 msec you think you will have.

This is my first kernel I built from scratch (so far I’ve only tinkered with other people’s kernels)

I love to learn from my mistakes, so please let me make some. ;-)


Here’s my code - it was originally based on the simpleTexture SDK sample. Comes with VS 2005 Express Solution and Project files. Read the last posts in this thread for more notes on usage and performance.

Surprisingly this runs in quick 67ms on an nVidia 9600GSO using 512x512 input data and a 512x512 pixel result array. This requires 2 separate kernel passes, the second pass is needed to accumulate the results, similarly to the 256 histogram SDK sample. It currently uses 11 thread blocks to process the full 512x512 input image so it will only execute in full speed on GPUs that have 12 SMs or more (e.g. 8800GS, 9600GSO, 8800GT etc…).

Right now we use just 384 because the registers use per thread does not allow more threads. I wanted to get the register count down so that I can use 512 threads per block. However I fear that this might not be possible without going to assembly code.

The result looks decent however I have not yet fully verified mathematical correctness. There is no “gold” CPU kernel for verification yet. I am also not sure if my texture lookups hit the centers of the pixels or the corners.

Results are saved as PGM files. I recommend the Cognaxon WSQ viewer to look at these. It is a lightweight viewer that works on PGM and PPM files.

UPDATE1: added some bounds checking for safety, sacrificing speed. I think there is a race condition left, resulting in some pixels not getting counted. The output from Emu Mode still looks brighter in some places.

UPDATE2: now allowing unlimited resolutions for the result data. However higher resolutions require more computation passes. I tried generating a 1024x1024 pixel result for a change - took 180ms, looks nice!

UPDATE3: tried to resolve the collisions during writes to shared memory. This allowed me to get rid of a syncthreads() call and gain more speed. However Emu and Device mode still show differences. WHY?

UPDATE4: Finally figured out the race condition. Now Emu and Device show the correct results.

Enjoy. I am sure this could be done even faster.


Hugh_out.jpg (13.6 KB)

Oh, offcourse :) Making mistakes is my core business. :P

But I hope my message was clear: It is very often smart to have your calculation-grid as a representation of your outcome matrix/array. Quite often when you have it represent your input-space, you have to :

  • do some freaky (slow) atomic operation

  • use N times as much memory and perform a reduction afterwards (this can btw. still be the fastest way to do these kinds of things, if you have enough memory for it)

As another note : it can be faster to use 256 threads per block (depends on register usage also) A MP can hold up to 768 threads at the same time. When using blocks of 256 threads, you can have 3 blocks per MP, or 2 blocks for 512 threads (when it will fit) When using 384 threads, you can have 2 blocks, but only when using very few registers.

Yes, I will try to keep that in mind.

In my case, one thread block consumes 16000 bytes of shared memory, leaving no opportunity for a second thread block to be scheduled on an SM. Oops.

I am still pondering over some problem with the transform. Apparently there is some kind of race in my accumulation in shared memory, resulting in different (i.e. correct) output only when using Emu mode.


Ooops!!! __sinf and __cosf are only accurate within [-pi, pi]. This was fixed pronto…

I also found a problem with several threads writing to the same shared memory location. I had to work around it by producing non-divergent branches within a warp. Ouch.

The result look entirely correct now. I think I can now try this transform on real-world source data.

Still the transform runs in reasonable time producing 512x362 pixel output from a 512x512 source image. This takes some 37ms here on a nVidia 9600GSO. Expect slower execution times when more pixels in the source image are non-black pixels.


Vertical axis on the result image is distance from center rho from 0 to half the diagonal of the source image. Horizontal axis on the result image is angle theta from -pi to pi.

You can freely choose the result’s horizontal dimension (angle) and the vertical dimension (distance). The Kernel time will scale linearly with horizontal result size.
Note that vertical dimension above 490 or so causes kernel execution time to suddenly double because multi-pass has to be done.

For source image heights other than 512 you might also replace the VERTICAL_GROUPING define with your actual source image height. I was intending to automate this, but didn’t get around to do it yet.

The overall result data amplitude can be adjusted with the scaling parameter, in case you want to analyze the result numerically. There is also a limit parameter which clamps the values at the specified peak value.

If you want to launch more thread blocks to more fully occupy your GPU (e.g. GTX 260/280) then divide VERTICAL_GROUPING by 2 or 4. This launches more blocks and requires more memory on the GPU, but it does not affect run time much.

When you download and try my code, replace the line in the end


with a


This is a bug I copied from the simpleTexture SDK sample. It results in dialog boxes popping up reporting invalid heap address.

Apparently when using cutLoadPGMf to load PGM files with a NULL pointer for storage, you have to call cutFree on the resulting allocated memory, NOT free.


I am attaching a version of my Hough transform kernel that uses global atomics to conserve memory and to lock regions of the result memory that I am about to update. The atomics are used to implement mutexes, each specific to a distinct region in the result memory.

COMPUTE CAPABILITY 1.1 REQUIRED (meaning the G80 chip is a no-no…)

Previously I’ve tried the atomicFloatAdd macro that someone had kindly posted to these forums, but performance was rather poor (I guess my writes were anything but sparse).

The main advantage of this Hough kernel is the memory savings, similar to what is achieved in the Histogram 256 sample when enabling atomics.

Christian (13.5 KB)