a __syncthreads_or(predicate) question when is <predicate> evaluated ?

Hello,

i have a simple ( hopefully not stupid ) question about the __syncthreads_or( predicate ) function.

Is the evaluation of done in the moment a thread arrives at this command or does the evaluation of waits until all threads (of the block) have arrived ?

I’m thinking about race conditions within my program … .

The manual says :

int __syncthreads_or(int predicate);

is identical to __syncthreads() with the additional feature that it evaluates

predicate for all threads of the block and returns non-zero if and only if

predicate evaluates to non-zero for any of them.

This gives me no idea of the point in time when gets evaluated, i guess that the evaluation waits until all threads have arrived.

Thanks for any answer.

– Stefan

If the time of evaluation of predicate matters, then you have a race condition. :) In order for well-defined behavior, predicate should really not depend on what the other threads are doing.

Yes, that’s true, i’m asking, because i want to know if i need an additional __syncthreads prior to the __syncthreads_or to avoid the race condition or if that is implicitly done by __syncthreads_or… .

To give a simple example :

int p_c = threadIdx.x / <some integer a>; // this gives the same number for a consecutive threads

while( __syncthreads_or( rho_f_n[p_c] > 0.001 ) )

{

  //... some evaluations that lead to 

  if( threadIdx.x % a == 0 )

  {

    rho_f_n[p_c] = ... ;

  }

  // do i need an additional __syncthreads here ? 

}

And: thanks for answering my questions .

Assuming that rho_f_n is in shared memory, then I don’t think you need __syncthreads_or() at all:

int p_c = threadIdx.x / <some integer a>; // this gives the same number for a consecutive threads

while( rho_f_n[p_c] > 0.001 )

{

  //... some evaluations that lead to 

__syncthreads(); // need this one to make sure some threads aren't evaluating the loop condition while others are changing it

if( threadIdx.x % a == 0 )

  {

    rho_f_n[p_c] = ... ;

  }

__syncthreads();

}

Although, now that I look at this, you might have a different problem (with either solution). All the threads in a block need to hit the same __syncthreads() call, or the behavior is undefined. If groups of threads in a block can exit the while loop sooner than others, you might end up with a deadlock.

Yes, rho_f_n is in shared memory, and possible deadlocks are one of the reasons to use a __syncthreads_or call, because it’s guaranteed that all threads (within a block) hit __syncthreads_or and depending of the solution of __syncthreads_or all threads (within a block) will continue or break the loop.

Something one should know here : if the evaluation of predicate is done in the moment a thread arrives at __syncthreads_or (so if we have a race condition) my algorithm could do one unnecessary cycle in the while loop, because it is guaranteed, that rho_f_n is descending and positive (rho_f_n is the estimated error of my numerical algorithm). So, even if there exists a race condition at my __syncthreads_or it would only decrease the efficiency of my program (slightly) and not change the result (it would only make the result more accurate than desired) … .

So, the main reason for my question is a better understanding of the __syncthreads_or command for future use and efficiency in this case.

Hm, another problem one could think about is the case when one thread wants to write to rho_f_n[p_c] and another thread wants to read that value at the same time, but since rho_f_n resides in shared memory and because of branch divergence serialization within an active warp at

if( threadIdx.x % a == 0 )

that case shouldn’t occur.

Actually, the reason I was worried is that it is not clear from this fragment that all threads in the block have the same value of p_c. If they do not, then threads in the same block will be looking at different memory locations for their termination condition and potentially terminate the loop at different times. That would make any use of __syncthreads() cause a deadlock.