how to create Histogram with CUDA?

Hi all, I am writing a CUDA kernel for Histogram on a picture, but I had no idea how to return a array from the kernel, and the array will change when other thread read it. Any possible solution for it?

__global__ void Hist(

	TColor *dst, //input image

	int imageW,

	int imageH,



	const int ix = blockDim.x * blockIdx.x + threadIdx.x;

	const int iy = blockDim.y * blockIdx.y + threadIdx.y;

if(ix < imageW && iy < imageH)


  int pixel = get_red(dst[imageW * (iy) + (ix)]);

				  //this assign specific RED value of image to pixel

data[pixel] ++; // ?? problem statement ...



@para d_dst: input image TColor is equals to float4.

@para data: the array for histogram size [255]

extern "C" void

cuda_Hist(TColor *d_dst, int imageW, int imageH,int* data) 


  dim3 threads(BLOCKDIM_X, BLOCKDIM_Y);

  dim3 grid(iDivUp(imageW, BLOCKDIM_X), iDivUp(imageH, BLOCKDIM_Y));

  Hist<<<grid, threads>>>(d_dst, imageW, imageH, data);


Maybe use atomicAdd…

the 8800 GTX (and GTS) do not support any atomic functions. I’ve seen hacks posted in the forum to work around this, but it would be safest to avoid them.

Your next option is to go with a reduction-style technique, but this is going to be complicated to make fast on compute capability 1.0 devices. It would work like a normal reduction, but one for each bin. It would be easier if you were computing the histogram for many pictures simultaneously.