Histogram Source Code For CUDA enbaled GPUs Fast Histograms with Any Number of Bins

Histogram calculation with an arbitrary number of bins is a problem on GPUs. nVidia released a histogram example which supports 256 bins for 8-bit data with CUDA 1.1 release. However the program is still very limited.

You can find the source code for histogram calculation with any number of bins that operates on 32-bit floating point data of any size (the input however needs to be between 0-1 range, but you can easily change the code to support any other range if you prefer not to normalize your data first) on my website:


The code is based on the following two publications:
author = “R. Shams and R. A. Kennedy”,
title = “Efficient Histogram Algorithms for {NVIDIA} {CUDA} Compatible Devices”,
booktitle = “Proc. Int. Conf. on Signal Processing and Communications Systems ({ICSPCS})”,
address = “Gold Coast, Australia”,
month = dec,
year = “2007”,
pages = “418-422”,

author = “R. Shams and N. Barnes”,
title = “Speeding up Mutual Information Computation Using {NVIDIA} {CUDA} Hardware”,
booktitle = “Proc. Digital Image Computing: Techniques and Applications ({DICTA})”,
address = “Adelaide, Australia”,
month = dec,
year = “2007”,
pages = “555-560”,
doi = “10.1109/DICTA.2007.4426846”,

I look forward to your feedback and comments.

The problem with all gpu hist calcs I’ve seen so far is too much data dependency.
Calculating hist for blackout image is 10 times slower than for normally distributed pixels usually used in tests.
In this case your approx hist calc claimed to be data independent is the most attractive.
Unfortunately it is very inaccurate unless it has bugs (in which case, please, fix it) otherwise histogram64 can be used for 8-bit images.
All your accurate hist implementations converted to work with 8-bit images loose to nvidia histogram256 performance-wise.

Thanks for posting this, interesting work.

I encourage other researchers to post their CUDA-related publications here on the forums.