How to optimize the operation

Hi, I’d like to do the binarize operation on a 8bit mono image.

and I hope to minimize the size of ouput data which has to be transfer from device to host.

So, I hope the ouput data can be 1 bit data casted as unsigned integer(which can be operated by atomic funcitons).

To use atomic function can prevent missing of multi-operation caused by multi-threads doing on the same 32bit data.

My code is as below, but it seems to be very slow.(about 200ms on GTX285)

And after my checking, the bottleneck is the atomic operation.

Is there any method to make it and prevent from multi-operation on the same pixel value?

__global__ void Binarize(unsigned int* Source,unsigned int* Dest,int SizeX,int SizeY,unsigned char th)

{

  unsigned int idxX = IMUL(blockIdx.x , blockDim.x) + threadIdx.x,

  idxY = IMUL(blockIdx.y , blockDim.y) + threadIdx.y,

  index_in = IMUL(idxY,SizeX)  + idxX,

  nSet = index_in/32,

  nBit = index_in%32,

  OneVal=1,

  ZeroVal=((OneVal<<nBit) ^ 0xFFFF);

	if (idxX<SizeX && idxY<SizeY){

		if (*(Source+index_in)>=th){

		  atomicOr((Dest+nSet),OneVal<<nBit);

		}

		else 

		  atomicAnd((Dest+nSet),ZeroVal);

		}

	}

}

If each thread processes 32 input pixels then you can eliminate the atomic operations altogether. Or use each thread to process 8 input pixels and write the equivalent output as unsigned char. I would expect either of these approaches to be much faster than what you’re doing now, though I don’t know which would be faster. I’d say try both.