GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell

Originally published at:

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical representation of the data distribution across predefined bins. The input data set and the number of bins can vary greatly depending on the domain, so let’s focus on one of the most common use…

Thanks for the interesting article!

Are there any optimisation ideas on computing multidimensional histograms?
With the size about 32^3 - 32^4 collisions doesn't matter much, but random access to the large hist array does. So I just store one histogram in global memory and fill it with atomics - so simple but found nothing better.

Hello, Serge,

Yes, multi-dimensional histograms can be represented as linear
histograms with the large amount of bins. In case of 32^3 = 32K bins,
you won’t be able to keep the full copy of the histogram in the shared
memory. If you have a lot of contention to a specific address you can
try a multi-pass approach similar to a radix sort. If the data
distribution is fairly random, you’re right, just plain simple global
atomics to a single histogram array might be the best choice here. In
this case the threads won’t be competing with each other for atomic
increments so you will be mostly limited by write bandwidth to global


Ok, thank you!

Just wanted to say thank you for this post! It helped tremendously in a research project I was working on which involved measuring the galaxy bispectrum, which is effectively a large histogram. I even cited this post as a reference. Here's the paper on arXiv:

Hello - there are a few major errors in your shared atomics kernels. They are as follows:

1. In the first kernel, out += g * NUM_PARTS; should be out += g * NUM_BINS;

2. In the second kernel, for (int j = 0; j < n; j++) should be written as for (int j = 0; j < NUM_PARTS; j++)

One should be as clear as possible in their presentation when sharing information.

3. In the second kernel, total += in[i + NUM_PARTS * j]; should be total += in[i + NUM_BINS * j];

The code as provided cannot achieve the desired objective in all cases as it results in access violations. It's surprising to me that Nvidia engineers wouldn't have tested code with known input values and CUDA-MEMCHECK prior to posting.

Hi Rajeev,

1. The original version with out += g * NUM_PARTS is correct since each threadblock will be writing a separate histogram per channel, so up to NUM_BINS * <number of="" channels=""> which is basically NUM_PARTS in the example code above. Sorry that it was not clear in the post.
I think the subsequent points #2 and #3 in your comment would be resolved by the clarification I provided on #1, but just in case to reiterate:
2. In the second kernel, thread i is summing up values from i-th bins across all per-block histograms, so the loop will be from 0 to n where n is the number of threadblocks from the first kernel.
3. Each block from the first kernel will be using NUM_PARTS as the offset so we need to use the same offset in the 2nd kernel when merging the per-block histograms.


Hi Nikolay,
In the first kernel using shared memory, why do you initialize smem[3 * NUM_BINS + 3]? Suppose that NUM_BIN is 256, so the size of smem is 771. But in the loop for writing partial histogram into the global memory, the i value varies from 0 to 255, which means smem at index 256, 513 and 770 is never accessed. Is there any purpose for initializing smem with one extra index for each channel?

Hi Cao, the smem size is initialized to 3 * NUM_BINS + 3 because I access channels with an offset, for example, atomicAdd(&smem[NUM_BINS * 2 + b + 2], 1) - the 3rd (blue) channel is accessed with offset 2, so we have to allocate 3 * NUM_BINS + 3 to avoid out of memory errors. My original intention was to skew/shift the channels position in shared memory to avoid bank conflicts, however, I don't think that this matters between two separate atomic instructions, so I think you can reduce the size to just 3 * NUM_BINS and then accordingly update the smem indices in atomicAdds and combining into the global histogram. Thanks for the feedback!

Hi Nikolay,

It is still unclear to me what NUM_PARTS is. In your comment, you say that NUM_BINS * num_channels = NUM_PARTS, however, in your code (following the link to the CUB repo), NUM_PARTS seems to be arbitrarily set to 1024. Can you explain what NUM_PARTS is?


We can further improve performance on reducing the global bins in the second kernel call by using some popular optimized parallel reduction techniques, but to make a difference, the bin size should be quite large.