Hi everybody, as part of a larger project we needed to implement a generalized histogram code for CUDA and I thought that other people could find use for it as well, so I spent some time optimizing it and making it as flexible as possible.
The project is available in github: Generalized Cuda-Histogram under the Apache v.2 license. The repository contains the histogram code (cuda_histogram.h) and samples (test*).
The histogram code is distributed as a single header-file (as independent as possible - to be used with CUDA runtime) and abstraction is handled through function-objects: The user defines a function-object, which is called once for each input-index and returns the bin-index (and the object to be added to the bin - this is just the integer ‘1’ for normal histograms) and the histogram code takes care of the rest. This way you can histogram arbitrary types of objects with an arbitrary binary operator (associativity and commutativity are required though). The code also supports arbitrary number of bins (tested up to 16384).
The performance is basically as good as, or better as the SDK-sample, expect that it is completely independent of input key-distributions (the SDK-code is slow for degenerate key-distributions) up to moderate histogram-sizes. It’s also about 20% faster than High Performance Predictable Histogramming on GPUs and when I tried it against NPP (1-channel 8-bit) histogram function, it was about 30-50% faster (for image sizes 1024x1024-8192x8192). For large histograms we alleviate the degenerate key-distribution problem by reducing the results within a warp using warp vote-instructions (when available). Also it should be noted that we have concentrated on optimization on Fermi-level HW.
If you need to add things to different bins in your CUDA-code, be sure to check the code if it works for you, and if not, tell us why. The code itself is still quite fresh and not yet well documented or cleaned up, so if you need to look into the implementation beyond the interfaces then it might be somewhat difficult to follow - I’ll start cleaning up the code next. We have ran quite a lot of tests, but as the code is new, there are bound to be issues to be found and fixed - please report any bugs you find and suggestions how to improve the code. Also feature requests are welcome and we could use some help in maintaining, supporting and testing the code - if you’re interested in contributing, then please, drop me a message.
Here is a list of features:
- Histograms of arbitrary sizes and bin-types
- Output to GPU or CPU-memory
- Runs on user-given CUDA-stream
- Multiple results per input index (one-dimensional index-range for now)
- Arbitrary binary reduction operator (has to be associative and commutative)
- User-given or automatically allocated temporary buffer (reduce latency)
- Fast-paths for normal histograms and scalar histograms (floats etc)
- Optimized for Fermi, should run on all CUDA-capable HW.
I’ll be more than happy to answer any questions you may have on the project.