Histogram256: thread-tagging 4 or 5 bit threadtag?

Hey guys,

Just a quick question about the SDK 256-histogram example:

__device__ void addData256(volatile unsigned int *s_WarpHist, unsigned int data, unsigned int threadTag)


  unsigned int count;


    count = s_WarpHist[data] & 0x07FFFFFFU;

    count = threadTag | (count + 1);

    s_WarpHist[data] = count;

  } while(s_WarpHist[data] != count);


The threadtag used to acheive software atomic updates of shared mem is 5 bits. This assumes 32 threads per warp, but if warps are actually executed as half-warps, does that mean we can get away with using a 4 bit threadtag?

I’ve also noticed that when I upgraded to Cuda 2.0b from 1.1 that the histogram256 example was removed fromt the SDK. Is this just a mistake or was it intentionally left out? Is there anything wrong with the “write combining” approach to shared mem?