I am trying to use the warp voting function to check whether value calculated by all threads of BLOCK are zero or not. Actually in my kernel the number of iterations within a block depend whether all the results (ie result by each thread) of the previous iteration are zero or not.
I see that __any(previous result) will check if any of the previous results of a particular warp was non-zero. What i decided was that I will add the result of __any () from each warp of the block atomically into a shared memory location and then check that location for zero. If its equal to zero then all previous results were zero and i will break the loop.
My problem is that I am not able to understand – whether this __any function is called by a particular thread in each block or do we have to call the __any function from 1 thread/block and then combine the results?
Kindly help also let me know if there is any better solution to this other than combine the voting function results.
Since you’re doing a block-wide check, the voting primatives won’t be enough and you’ll have to use shared memory to intercommunicate. And if you’re going to do that, you may as well just skip the vote functions and do it all with shared memory.
To do a block-wide ANY, use code similar to this:
__shared__ int anyresult;
/* whatever per-thread code here... */
if (mythreadvote) anyresult=1;
/* anyresult holds the block-wide any vote */
You can do an ALL compute in the same way.
This method is also useful even for per-warp tests on older hardware without the vote functions.
Thanks for your reply SPWorley. I have already tried a way very similar to what you have told. But I was hoping that voting function if used properly may do the job much faster than this as its probably a hardware improvement.
Can I have a very naive question here? My understanding of Cuda is minimal at best, and I need something like this. My guess that the “mythreadvote” stands for the per thread condition like “array[tid]==0”, but writing to the same shared memory variable is not going to serialize the whole operation? because I have to check quite huge arrays for a simple condition, and i not even interested in how many conditions came up false, if one does than I can move on.
Writing into same bank in shared memory leads to serialisation. There is no “broadcast” for writing, as it works only in reading. Therefore using warp vote functions, if possible, may be not that bad idea. Here is how I would transform SPWorley’s code:
if (threadIdx.x==0) blockResult=false;
if ((threadIdx.x&31)==0 && warpResult) blockResult=true; //execute it for first thread of every warp only
Note the parenthesis around (threadIdx.x&31). The & operator has lower priority than the comparison ==.