Hey guys,
Just a quick question about the SDK 256-histogram example:
__device__ void addData256(volatile unsigned int *s_WarpHist, unsigned int data, unsigned int threadTag)
{
unsigned int count;
do{
count = s_WarpHist[data] & 0x07FFFFFFU;
count = threadTag | (count + 1);
s_WarpHist[data] = count;
} while(s_WarpHist[data] != count);
}
The threadtag used to acheive software atomic updates of shared mem is 5 bits. This assumes 32 threads per warp, but if warps are actually executed as half-warps, does that mean we can get away with using a 4 bit threadtag?
I’ve also noticed that when I upgraded to Cuda 2.0b from 1.1 that the histogram256 example was removed fromt the SDK. Is this just a mistake or was it intentionally left out? Is there anything wrong with the “write combining” approach to shared mem?
Thanks,
-Shihab