Hi, I carried out the “tag and test” idea hinted by Mark’s collegue(see the hugest post in this board), but i can’t do it correctly.
Besides, even if it’s correct, it’s slow, since my global memory copying is also shared among threads, also in critical sections:
(my s_cnt is like the histo[bin] in the previous post)
CODE do{
d_GM[..(s_cnt & 0x7FFFFFF)] = d_GM[xx];
val = (s_cnt & 0x7FFFFFF);
val = ((tx & 0x1F) << 27) | (val + 1);
s_cnt = val;
}while(s_cnt != val);
at last, we write out s_cnt = s_cnt & 0x7FFFFFF
But the resulted s_cnt is incorrect, about the same with that of no “tag and test” at all. Any clues? thanks!