how to carry out the suggested "tag and test"idea thread synchronization

Hi, I carried out the “tag and test” idea hinted by Mark’s collegue(see the hugest post in this board), but i can’t do it correctly.

Besides, even if it’s correct, it’s slow, since my global memory copying is also shared among threads, also in critical sections:

(my s_cnt is like the histo[bin] in the previous post)

CODE do{

d_GM[..(s_cnt & 0x7FFFFFF)] = d_GM[xx];      

val = (s_cnt & 0x7FFFFFF);

val = ((tx & 0x1F) << 27) | (val + 1);

s_cnt = val;

    }while(s_cnt != val);

at last, we write out s_cnt = s_cnt & 0x7FFFFFF

But the resulted s_cnt is incorrect, about the same with that of no “tag and test” at all. Any clues? thanks!

That technique only works with shared memory, not global. And even then it’s restricted to updating one word per warp, not per block.

The thing is, warps may be swapped in and out at any moment.