compute histogram with large bins

Hi, I am now working on a program which require me to compute a histogram which has 256256 bins. I have read the relevant article in cuda-by-example and I complete the first cuda example of large bins(without allocate share memory) whose performance is bad.
The problem is how I can allocate the share memory just like the second example on 256
256 bins. Here is the part in the book:

global void histo_kernel( unsigned char *buffer,
long size,
unsigned int *histo ) {

__shared__  unsigned int temp[256];
temp[threadIdx.x] = 0;

int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
while (i < size) {
    atomicAdd( &temp[buffer[i]], 1 );
    i += stride;

atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );


Any one has idea? Thanks

You can’t, because 256256sizeof(int)=256KByte, while there are at most 48KByte per SM. What you can do is multiple passes where each pass covers only a subset of the bins. If you encode the pass in blockIdx.x, you might even have a good chance that the different passes for a block run at the same time and the data is cached.

Thanks, then how can I synchronize with the global memory instead of using “atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );” ? Can you show some details?

Thinking about it, it doesn’t seem worthwhile to use shared memory in this case, unless with more elaborate schemes. Just operate directly on global memory:

__global__ void histo_kernel( unsigned short *buffer, long size, unsigned int *histo ) {

    int i = threadIdx.x + blockIdx.x * blockDim.x;

    int stride = blockDim.x * gridDim.x;

    while (i < size) {

        atomicAdd( &histo[buffer[i]], 1 );

        i += stride;



It’s possible to reduce the number of atomic operations on global memory using shared memory bin counters of just a few bits. On compute capability 1.x however with 65536 bins there will only be a single bit per bin (and a few bins with no bit at all), halving the atomic ops on main memory at best.

The performance of that version is relatively low, even worse than the one on cpu… Is using share memory the only way to inhance the performance?

OK, then we’ll have to dig a bit deeper. What GPU are you using? What CPU are you using? Is the data on the CPU or on the GPU originally?
Anything you know about the data (distribution, correlations)?

It might be difficult to beat the CPU in this problem if nothing more is known about the data.

Actually, the gpu I 'm now using is not a good one, I am going to move on to GTX 580 later. But now I just want to enhance the performance as much as I can.

GPU: GT555m 2 multiprocessor blocks, 96 stream processors, global memory: 962 MBytes, maximum 1024 threads per block, constant memory:65536 bytes, share memory:49152 bytes, memory bus width: 128-bit

CPU: i5 2410

The data are two int arrays, each one is 1010241024 long, the number is 0~255. They are initially made in CPU, then I use cudaMemcpy to copy them to GPU. In the CPU end the histogram is two-dimentional histo[256][256], so I use atomicAdd( &histo[array1[i]+256*array2[i], 1 ); to compute it in the GPU.

Any ideas?

Edited: My original idea relied on multiplying the two keys together - which doesn’t work at all

Instead - use the technique from the lexicographical sorting example of thrust to correctly sort the pairs. Using this technique will allow you to use the radix sort instead of defining a custom < operator for the pairs which would force the use of a slower O(N log N) sort.

Then you can use lower_bound with a custom compare function for the pairs to determine the start and end of each bin.

You mean giviing up the atomicAdd operation? But sorting may also be slow even the radix sort…

In terms of big-O notation, both the radix sort and the atomicAdd approach are O(N). I would try out the radix sort approach to give you a baseline speed and then see if you can improve upon it with an atomicAdd approach. The radix sort in thrust is very fast (especially since all your values are only 8 bits) and this binning algorithm should be straightforward to implement with thrust.

The atomicAdd approach almost surely has worse worst case performance, e.g. if all of the values fall into the same bin, there will be a huge amount of contention between the atomicAdds and performance will be quite terrible.

Thank you. Can you show me some details?

I will change the direction to fit the 48k limite on share memory. Theoretically it can allow 110110sizeof(int) which is 12100 bins. But when I try this on my GPU, it will always return wrong results. Is it all because using the share memory? I dont know why…