GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell

jwitsoe · February 10, 2015, 10:58pm

Originally published at: GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell | NVIDIA Technical Blog

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical representation of the data distribution across predefined bins. The input data set and the number of bins can vary greatly depending on the domain, so let’s focus on one of the most common use…

anon41874091 · March 20, 2015, 8:22am

Thanks for the interesting article!

Are there any optimisation ideas on computing multidimensional histograms?
With the size about 32^3 - 32^4 collisions doesn't matter much, but random access to the large hist array does. So I just store one histogram in global memory and fill it with atomics - so simple but found nothing better.

anon15011306 · March 25, 2015, 11:58pm

Hello, Serge,

Yes, multi-dimensional histograms can be represented as linear
histograms with the large amount of bins. In case of 32^3 = 32K bins,
you won’t be able to keep the full copy of the histogram in the shared
memory. If you have a lot of contention to a specific address you can
try a multi-pass approach similar to a radix sort. If the data
distribution is fairly random, you’re right, just plain simple global
atomics to a single histogram array might be the best choice here. In
this case the threads won’t be competing with each other for atomic
increments so you will be mostly limited by write bandwidth to global
memory.

Thanks,
Nikolay.

anon41874091 · March 26, 2015, 2:57pm

Ok, thank you!

anon6764049 · December 20, 2017, 2:42pm

Just wanted to say thank you for this post! It helped tremendously in a research project I was working on which involved measuring the galaxy bispectrum, which is effectively a large histogram. I even cited this post as a reference. Here's the paper on arXiv: https://arxiv.org/abs/1712....

anon30108381 · April 24, 2018, 6:46pm

Hello - there are a few major errors in your shared atomics kernels. They are as follows:

1. In the first kernel, out += g * NUM_PARTS; should be out += g * NUM_BINS;

2. In the second kernel, for (int j = 0; j < n; j++) should be written as for (int j = 0; j < NUM_PARTS; j++)

One should be as clear as possible in their presentation when sharing information.

3. In the second kernel, total += in[i + NUM_PARTS * j]; should be total += in[i + NUM_BINS * j];

The code as provided cannot achieve the desired objective in all cases as it results in access violations. It's surprising to me that Nvidia engineers wouldn't have tested code with known input values and CUDA-MEMCHECK prior to posting.

anon15011306 · May 7, 2018, 7:27pm

Hi Rajeev,

1. The original version with out += g * NUM_PARTS is correct since each threadblock will be writing a separate histogram per channel, so up to NUM_BINS * <number of="" channels=""> which is basically NUM_PARTS in the example code above. Sorry that it was not clear in the post.
I think the subsequent points #2 and #3 in your comment would be resolved by the clarification I provided on #1, but just in case to reiterate:
2. In the second kernel, thread i is summing up values from i-th bins across all per-block histograms, so the loop will be from 0 to n where n is the number of threadblocks from the first kernel.
3. Each block from the first kernel will be using NUM_PARTS as the offset so we need to use the same offset in the 2nd kernel when merging the per-block histograms.

Nikolay.

anon24353161 · November 9, 2018, 6:21am

Hi Nikolay,
In the first kernel using shared memory, why do you initialize smem[3 * NUM_BINS + 3]? Suppose that NUM_BIN is 256, so the size of smem is 771. But in the loop for writing partial histogram into the global memory, the i value varies from 0 to 255, which means smem at index 256, 513 and 770 is never accessed. Is there any purpose for initializing smem with one extra index for each channel?

anon15011306 · November 20, 2018, 6:52pm

Hi Cao, the smem size is initialized to 3 * NUM_BINS + 3 because I access channels with an offset, for example, atomicAdd(&smem[NUM_BINS * 2 + b + 2], 1) - the 3rd (blue) channel is accessed with offset 2, so we have to allocate 3 * NUM_BINS + 3 to avoid out of memory errors. My original intention was to skew/shift the channels position in shared memory to avoid bank conflicts, however, I don't think that this matters between two separate atomic instructions, so I think you can reduce the size to just 3 * NUM_BINS and then accordingly update the smem indices in atomicAdds and combining into the global histogram. Thanks for the feedback!

anon35731234 · March 28, 2019, 5:57pm

Hi Nikolay,

It is still unclear to me what NUM_PARTS is. In your comment, you say that NUM_BINS * num_channels = NUM_PARTS, however, in your code (following the link to the CUB repo), NUM_PARTS seems to be arbitrarily set to 1024. Can you explain what NUM_PARTS is?

Thanks,
Zan

anon83263795 · February 26, 2020, 3:32am

We can further improve performance on reducing the global bins in the second kernel call by using some popular optimized parallel reduction techniques, but to make a difference, the bin size should be quite large.

Topic		Replies	Views
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	9027	August 3, 2017
compute histogram with large bins CUDA Programming and Performance	11	20415	July 29, 2011
Histogram implementation CUDA Programming and Performance	12	2368	November 27, 2014
Fast 256-bin histogram CUDA Programming and Performance	6	2363	May 9, 2016
SDK histogram256 on GTX280 SDK histogram256 performance on nVIDIA GeForce GTX280 CUDA Programming and Performance	0	1936	December 8, 2008
Histogram without Atomics CUDA Programming and Performance	0	2190	October 9, 2011
Histogram computation CUDA Programming and Performance	7	6371	February 8, 2008
shared atomics with compute capability 1.1 CUDA Programming and Performance	5	4806	July 10, 2008
the memory using about histogram CUDA Programming and Performance	0	1772	November 1, 2007
Performance of Histogram256 demo Atomic writes are slow when conflict CUDA Programming and Performance	5	5781	August 1, 2008

GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell

Related topics