WARP Voting function

sidxavier · April 26, 2009, 11:58pm

Hi,

I am trying to use the warp voting function to check whether value calculated by all threads of BLOCK are zero or not. Actually in my kernel the number of iterations within a block depend whether all the results (ie result by each thread) of the previous iteration are zero or not.

I see that __any(previous result) will check if any of the previous results of a particular warp was non-zero. What i decided was that I will add the result of __any () from each warp of the block atomically into a shared memory location and then check that location for zero. If its equal to zero then all previous results were zero and i will break the loop.

My problem is that I am not able to understand – whether this __any function is called by a particular thread in each block or do we have to call the __any function from 1 thread/block and then combine the results?

Kindly help also let me know if there is any better solution to this other than combine the voting function results.

Thanks
Sid.

SPWorley · April 27, 2009, 11:47am

Since you’re doing a block-wide check, the voting primatives won’t be enough and you’ll have to use shared memory to intercommunicate. And if you’re going to do that, you may as well just skip the vote functions and do it all with shared memory.

To do a block-wide ANY, use code similar to this:

__shared__ int anyresult;

/* whatever per-thread code here... */

anyresult=0; 

__syncthreads();

if (mythreadvote) anyresult=1;

__syncthreads();

/* anyresult holds the block-wide any vote */

You can do an ALL compute in the same way.

This method is also useful even for per-warp tests on older hardware without the vote functions.

sidxavier · April 27, 2009, 12:28pm

Thanks for your reply SPWorley. I have already tried a way very similar to what you have told. But I was hoping that voting function if used properly may do the job much faster than this as its probably a hardware improvement.

Thanks.

Sid.

Since you’re doing a block-wide check, the voting primatives won’t be enough and you’ll have to use shared memory to intercommunicate. And if you’re going to do that, you may as well just skip the vote functions and do it all with shared memory.

To do a block-wide ANY, use code similar to this:
__shared__ int anyresult;

/* whatever per-thread code here... */

anyresult=0; 

__syncthreads();

if (mythreadvote) anyresult=1;

__syncthreads();

/* anyresult holds the block-wide any vote */
You can do an ALL compute in the same way.

This method is also useful even for per-warp tests on older hardware without the vote functions.

Exhodus · March 16, 2010, 8:54pm

Since you’re doing a block-wide check, the voting primatives won’t be enough and you’ll have to use shared memory to intercommunicate. And if you’re going to do that, you may as well just skip the vote functions and do it all with shared memory.

To do a block-wide ANY, use code similar to this:
__shared__ int anyresult;

/* whatever per-thread code here... */

anyresult=0; 

__syncthreads();

if (mythreadvote) anyresult=1;

__syncthreads();

/* anyresult holds the block-wide any vote */
You can do an ALL compute in the same way.

This method is also useful even for per-warp tests on older hardware without the vote functions.

Hi

Can I have a very naive question here? My understanding of Cuda is minimal at best, and I need something like this. My guess that the “mythreadvote” stands for the per thread condition like “array[tid]==0”, but writing to the same shared memory variable is not going to serialize the whole operation? because I have to check quite huge arrays for a simple condition, and i not even interested in how many conditions came up false, if one does than I can move on.

Cygnus_X1 · March 17, 2010, 12:45am

Writing into same bank in shared memory leads to serialisation. There is no “broadcast” for writing, as it works only in reading. Therefore using warp vote functions, if possible, may be not that bad idea. Here is how I would transform SPWorley’s code:

__shared__ blockResult;

if (threadIdx.x==0) blockResult=false;

__syncthreads();

bool warpResult=__any(yourPredicate);

if ((threadIdx.x&31)==0 && warpResult) blockResult=true; //execute it for first thread of every warp only

__syncthreads();

Note the parenthesis around (threadIdx.x&31). The & operator has lower priority than the comparison ==.

Exhodus · March 17, 2010, 10:00am

Writing into same bank in shared memory leads to serialisation. There is no “broadcast” for writing, as it works only in reading. Therefore using warp vote functions, if possible, may be not that bad idea. Here is how I would transform SPWorley’s code:
__shared__ blockResult;

if (threadIdx.x==0) blockResult=false;

__syncthreads();

bool warpResult=__any(yourPredicate);

if ((threadIdx.x&31)==0 && warpResult) blockResult=true; //execute it for first thread of every warp only

__syncthreads();
Note the parenthesis around (threadIdx.x&31). The & operator has lower priority than the comparison ==.

Thank you for the quick answer !

the only thing i dont really understand in the code is the (threadIdx.x&31) part, what does that &31 do?

You7878 · March 25, 2010, 10:44pm

threadidx.x & 31 is equal to threadidx.x % 32
actually compiler optimize operation a % b if b is pow of 2

Topic		Replies	Views
can anybody explain warp vote functions CUDA Programming and Performance	9	11527	February 11, 2011
Is there a block vote (analogous to warp vote?) CUDA Programming and Performance	7	20818	July 20, 2009
throughput of warp vote functions? CUDA Programming and Performance	6	7735	March 29, 2010
Vote functions in a warp-divergent branch? Are they allowed? How idle threads are handled? CUDA Programming and Performance	5	19448	September 24, 2010
Warp Vote Functions CUDA Programming and Performance	1	1274	December 3, 2009
How to control warps? CUDA Programming and Performance	2	570	May 14, 2018
do warp vote functions cause branching? CUDA Programming and Performance	16	3797	August 11, 2010
Block-wide voting using shared memory Unexpected results CUDA Programming and Performance	7	7826	December 12, 2010
Most efficient blockmin function? CUDA Programming and Performance	12	5068	April 6, 2009
__syncthreads and shared memory CUDA Programming and Performance	21	4735	June 15, 2011

WARP Voting function

Related topics