Histogram computation

I have been reading the pdf on the histogram sample, and the approach seems awfully complicated.

What is wrong with just having each thread block (corresponding to an image block) have a shared memory of 256 ints representing 256 bins, and each thread computes its pixel’s index into the local histogram, then increments it. I know there may be pixels in a block with the same value, but wont the automatic serializations of shared memory writes (bank conflicts) deal with any simultaneous writes? Does it have something to do with an increment involving both a read and then a write, which might not happen atomically?

your hunches are right; shared memory do not have concurrent control among threads. no atomic operations. so there’re overwrite hazards.

btw, G84 and after have atomic operations for global memory, but still not shared memory. this reduces hardware complexity and cost, but increases software load.

Hi, I also have a question regarding the histogram256 computation: is it somehow possible to switch off the warp scheduling to force the execution order of individual warps to be strictly serial, so you can use per block subhistograms instead of per warp histograms? Thanks for any ideas…

Yes, by making the block size 32 threads :)

Thanks for your answer! :)

Next attempt: What I was interested in is some kind of implementation of atomic add function between different warps of the same block. For I like the scheduling of warps during global memory accesses…

…1 warp/block configuration is not efficient enough and I have not enough shared memory to save even 2 per-warp histograms.

What can be expected (in detail) by ignoring the undefined behaviour between warps (apart from wrong result)? Are there some generall statements about how big the deviation can be or how it is depending on input data and its distribution? Thanks.

In detail? The results will be complete rubbish. A variance of infinity loosely speaking.

I have a kernel that does a “histogram” of sorts: it actually builds a list of all values in each bin (spatial partitioning). For fun one day, I implemented a version of it that ignored the race conditions due to the lack of atomic increments. It usually built lists with only 1 or two elements in them, where each list should have had more than 100.

Strange… my just-for-fun-test resulted in a small deviation (<5%) but i tried just once. Of course i have to think of something else, thanks!

The difference is probably that my test used global memory. The higher latency leads to higher errors.

Depending how how the values you are counting are distributed in the images and the order in which your warps run (i.e. all warps counting value A run concurrently) you can imagine that the count can go anywhere from 1 up to the actual value of the count.