UPDATE: Code is now available, see a few posts below (in the post with the attached screenshot)
I am currently working on a fast Hough transform implementation for computer vision. This will be used on satellite imagery to detect the typical US urban road layout. Basically my program will look for signs of civilization on services like Google Maps. The goal is to create synthetic night lighting like this from aerial images taken in daylight.
(this is an actual aerial photography).
Background: A useful application of the Hough transform is to detect straight lines in a contour image, which may for example result from running an edge detection algorithm over a photograph. A Hough transform transforms each pixel (x,y) from the source image into a sine wave in parameter space (rho, theta) representing all lines that may run through the pixel (x,y).
In my implementation, each thread block computes the entire parameter range rho,theta for a specific square from the source image. Computation is performed entirely in floating point. Due to the restricted size of shared memory (only ~4096 floats) I have to do several passes across distinct subsets of theta within each thread.
After each pass I have to accumulate my results in global memory because I must re-use my precious and limited shared memory for a different subset of theta. Since my results are floating point, I cannot use atomic adds. Duh. So what should I do?
If I simply use the += operator on variables in global memory, I fear that two thread blocks accessing the same location in the result memory would collide and run into a race condition, such that results from one or more thread blocks are not added to the total tally.
Do you think it is feasible to implement some kind of inter-block semaphore or mutex mechanism to emulate blockwise atomic adds? Say, if one thread block needs to accumulate its results for the first sub-range of theta in global memory space, then the other blocks either would have to wait before doing the same or they would compute and accumulate results for a different range first, until they can acquire the mutex for the first sub-range.
Has anyone posted readily available macros for mutexes in global memory on these forums yet?
What I would like to avoid is having to dump my results to separate memory spaces for each thread block, like it is done in the 256 histogram sample (when no atomic support is available). In my case there may be many blocks required for a typical 512x512 pixel source image, so this is simply not feasible - to much result memory required!
Can a thread block detect on which SM it is executed? (last time somone said “No”, but I would like to get a few more “Forget that” statements before I belive it). Then on a machine with 12 SMs I would only need 12 distinct result buffers in global memory regardless of the number of thread blocks used in the grid.
Another idea: each executing thread block could atomically increment a global counter. Then I take this counter modulo he number of SMs on the GPU… All thread blocks running simultaneously on the same GPU will then have different counter values at any given time, which I could use as an index into the result buffer. A GPU with 12 SMs would then only need 12 result buffers. These can finally be summated with another kernel call. Is this feasible?
BTW Is anyone interested in my Hough transform implementation when it is done? I hope to get it down to about 60ms for a 512x512 grayscale source image (with a resulting parameter space of about half the source pixel count). Without summating any results in global memory yet, I currently get about 30ms of execution time on a G92 chip (8800GT)