Code: General purpose histogram

Hi everybody, as part of a larger project we needed to implement a generalized histogram code for CUDA and I thought that other people could find use for it as well, so I spent some time optimizing it and making it as flexible as possible.

The project is available in github: Generalized Cuda-Histogram under the Apache v.2 license. The repository contains the histogram code (cuda_histogram.h) and samples (test*).

The histogram code is distributed as a single header-file (as independent as possible - to be used with CUDA runtime) and abstraction is handled through function-objects: The user defines a function-object, which is called once for each input-index and returns the bin-index (and the object to be added to the bin - this is just the integer ‘1’ for normal histograms) and the histogram code takes care of the rest. This way you can histogram arbitrary types of objects with an arbitrary binary operator (associativity and commutativity are required though). The code also supports arbitrary number of bins (tested up to 16384).

The performance is basically as good as, or better as the SDK-sample, expect that it is completely independent of input key-distributions (the SDK-code is slow for degenerate key-distributions) up to moderate histogram-sizes. It’s also about 20% faster than High Performance Predictable Histogramming on GPUs and when I tried it against NPP (1-channel 8-bit) histogram function, it was about 30-50% faster (for image sizes 1024x1024-8192x8192). For large histograms we alleviate the degenerate key-distribution problem by reducing the results within a warp using warp vote-instructions (when available). Also it should be noted that we have concentrated on optimization on Fermi-level HW.

If you need to add things to different bins in your CUDA-code, be sure to check the code if it works for you, and if not, tell us why. The code itself is still quite fresh and not yet well documented or cleaned up, so if you need to look into the implementation beyond the interfaces then it might be somewhat difficult to follow - I’ll start cleaning up the code next. We have ran quite a lot of tests, but as the code is new, there are bound to be issues to be found and fixed - please report any bugs you find and suggestions how to improve the code. Also feature requests are welcome and we could use some help in maintaining, supporting and testing the code - if you’re interested in contributing, then please, drop me a message.

Here is a list of features:

  • Histograms of arbitrary sizes and bin-types
  • Output to GPU or CPU-memory
  • Runs on user-given CUDA-stream
  • Multiple results per input index (one-dimensional index-range for now)
  • Arbitrary binary reduction operator (has to be associative and commutative)
  • User-given or automatically allocated temporary buffer (reduce latency)
  • Fast-paths for normal histograms and scalar histograms (floats etc)
  • Optimized for Fermi, should run on all CUDA-capable HW.

I’ll be more than happy to answer any questions you may have on the project.

Hi again,

I just recently wrote a benchmark app, that measures the performance of this algorithm at various histogram-sizes (normal histograms only for now), so check it out from the link above (I’m also attaching the benchmark here for convenience, but for latest code, refer to the github site ).

I have done also a couple of optimizations to the code since last time and now small histograms run about 10-30% faster (30% at 256 bins) and large histograms should run something like 20-100% faster depending on case (although there is still room here for improvement).

I got on Tesla M2070 the following performance for uniform random data (algorithm is independent of input distribution up-to 320 bins):

bins    Gigakeys/s

64       42.0

128      33.5

256      17.4

512      7.2

1024     4.78

2048     1.95

4096     0.66

If you could take the moment and run this benchmark on your GPU and send me the results or post here, I would really appreciate it. I would be most interested in seeing the performance of this on GeForce 580. On pre-Fermi cards this will run slower as we have mostly concentrated on optimizations on new hardware, but results with any cards are appreciated.



Btw. the test has been written on Linux, so it might not compile out of the box on windows, but the only thing that should be needed for the port, would be to write the timer-code on windows (around line 500 on Compile it directly with nvcc something like this: “nvcc -O4 -arch=sm_20” and run by “./a.out --rnd” - random distribution, “./a.out --load” read data from “texture.raw” (need to download from github) or “./a.out” to run with almost degenerate distribution of keys. It will take a couple of minutes to run and uses quite a lot of GPU-memory (so in case you see “assert(INPUT) failed” the test ran out of memory).
cuda_histogram.h (110 KB) (17 KB)