Histogram implementation

Hello , I wanted to ask :

If I have a histogram defined in c code like this:

const unsigned int myS = 256;

for ( unsigned int k = 0; k < AN; k++ ) {
    for ( unsigned int y = 0; y < myHeight; y++ ) {
      for ( unsigned int x = 0; x < myWidth; x++ ) {
        histogram[ (k * myS) + myImage[ (k * myHeight * myWidth) + (y * myWidth) + x ] ] += 1;

The right implementation in cuda wil be:

int y = threadIdx.y + blockIdx.y * blockDim.y; 
int x = threadIdx.x + blockIdx.x * blockDim.x; 
int Idx = x + y * myWidth;
if ( x < myWidth && y < myHeight )
	for ( unsigned int k = 0; k < AN; k++ )
		int theIdx = devImage[ Idx + (k * myWidth * myHeight) ];
		atomicAdd( &( devHistogram[ (k * myS) +  theIdx ] ) ,  1 );

Is this right?

Also, is there an easy solution for not using atomicAdd?

Using shared memory?


I would suggest using a histogramming function from CUB:


or thrust:


rather than writing your own.

OK, thanks ,but what if I don’t want to use libraries?

Is the above implementation right?The problem is with the indices.

Or,is it an easy way using shared memory?

I tried to run the code using a simple data array and I am not getting the same results with the serial code…

What is strange ,is that every time I run the code ( the executable) it increases some values by 1 !!!

I also tried this approach:

if ( x < myWidth && y < myHeight )

for ( unsigned int k = 0; k < AN; k++ )

    atomicAdd( &( devHistogram[ (k * myS) + devImage[ (k * myWidth * myHeight) + (y * myWidth) + x ] ] ) , 1 );

but still the same…

Any ideas?


CouldUr global memory also needs to be set top zero. (of course these are all obvious things).


  1. reducing in shared memory
  2. reduce to global memory

I am not using shared memory.
Do what with global memory??
I hava allocated memory and then free.

Memory allocation does not clear the memory.

Ok ,in every code I am using , I allocate memory , use it and then free.
I don’t knwo what else should I do!

They ask you to initialize the devHistogram array with 0. After allocating the array contains “random” data (actually not random, but the values that were there before the allocation). Then in your code you add values to these random numbers and that might be the source of your problem.

If you run the code again and the memory allocation gives you the same space in memory, you will start from the values of the previous run.

Ok,I understood now , but the problem was that after each compilation and execution of the code the value was increasing!
If I initialize the histogram or not ,it doesn’t matter ( I mean to have such an error)

I found the error.It seems that it was in the “AN” variable , in it’s size.

If someone can help with the use of shared memory!


  1. Let each block produce a histogram in shared memory
  2. add said shared memory histogram to global memory histogram.
    NOTE: don’t forget to initalize global memory

Something along the lines of:

// I give no warranties WHAT SO EVER that his untested code would work :-)
__global__ void histrogram_stub_code_kernel( uint8_t* d_in_ptr, int* d_histogram, int N )

    const int B_DIM_X = 256;
	// Shared memory histogram
	__shared__ int s_histogram[B_DIM_X];
	int tx = threadIdx.x  + blockIdx.x * B_DIM_X;

	// Initalize
	s_histogram[threadIdx.x] = 0;


	if( tx < N)
		// Read data into registers
		uint8_t reg_val = d_in_ptr[tx];
		// Atomic add to shared memory buffer
		atomicAdd( &s_histogram[reg_val], 1);


	if( tx < N)
		atomicAdd( &d_histogram[threadIdx.x], s_histogram[threadIdx.x]);


WARRANTY Completely untested piece of code that I just threw together :-)

Ok , thank you for the idea.