when I read the sample of histogram64, several questions puzzled me. In the kernel program, the threadPos is computed by three bit operation:
//Encode thread index in order to avoid bank conflicts in s_Hist[] access:
//each half-warp accesses consecutive shared memory banks
//and the same bytes within the banks
const int threadPos =
//[31 : 6] <== [31 : 6]
((threadIdx.x & (~63)) >> 0) |
//[5 : 2] <== [3 : 0]
((threadIdx.x & 15) << 2) |
//[1 : 0] <== [5 : 4]
((threadIdx.x & 48) >> 4);
why do it like that? is there any reason?
the another question is that:
in the main program code, the h_Data array is generated in the range of [0,255], why in the computing, it said: only 64-bit histogram of 8-bit input data array is calculated, only highest 6 bits of each 8-bit data element are extracted?
//Cycle through current block, update per-thread histograms
//Since only 64-bit histogram of 8-bit input data array is calculated,
//only highest 6 bits of each 8-bit data element are extracted,
//leaving out 2 lower bits.
for(int pos = threadIdx.x; pos < dataSize; pos += blockDim.x){
unsigned int data4 = d_Data[baseIndex + pos];
addPixel64(s_Hist, threadPos, (data4 >> 2) & 0x3FU);
addPixel64(s_Hist, threadPos, (data4 >> 10) & 0x3FU);
addPixel64(s_Hist, threadPos, (data4 >> 18) & 0x3FU);
addPixel64(s_Hist, threadPos, (data4 >> 26) & 0x3FU);
I am now been confused by the code, expecting any one can give me some detail explanation. I have read the histogram.pdf, but can’t find the answer.
any reply is grateful.