Is there any way to syncronize a warp’s threads instead of all the block’s threads?
Something like a syncwarp() to force all the SIMD elements to reach the same point.
AFAIK a warp is always implicitly synchronized.
Yep but figure they follow different IF branches… I need a way to sync them. I could use a __syncthreads(), but that will sync the complete thread block… and I want to sync only the warp’s results.
er, what? if some threads in a warp execute one basic block and some threads execute another, the entire warp will execute both basic blocks–it’s just that the outputs for some threads will be disabled for each basic block.
Let me clarify. I need that to perform coherent packing. I have 32 results and I need to take a decision based on the result of the 32 threads. That’s why I need to sync a warp… and I need to perform it manually.
A practical example… Imagine I use 192 threads per block. I have, then, 6 groups of 32 threads.
unsigned int warpID = threadIdx.x/WARP_SIZE;//warpSize is 32 currently
bool hit = testHit(...);
if ( 0==warpID )
{
 __syncthreads();
//reduce the hit result of threads [0-31] and take a decision
}
else
{
if ( 1==warpID )
{
__syncthreads();
//reduce the hit result of threads [32-63] and take a decision
}
else
{
if ( 2==warpID )
{
__syncthreads();
//reduce the hit result of threads [64-95] and take a decision
}
etc etc
}
}
I was wondering if there is a more effective way to perform this skipping those if ( xxx=warpID )…
Of course, I could set 32 threads per block so __syncthreads() will be the syncwarp() I want… but 32 threads per block is not very optimal.
thx
Are the additional warp voting functions in Compute 1.2 not useful? Might not provide enough granularity, but it seems like shared memory atomics might be an option (unless Compute 1.2 isn’t an option in the first place).
Oh sorry, I forgot it… I’m restricted to capability 1.0. The code must run in an old GF8800GTX/Tesla.
Sorry, but I completely do not understand your problem, are you sure you do? With only 32 threads __syncthreads() is a nop, it has no effect whatsoever (except for taking 4 cycles to execute).
I also can’t see the point of your if-then-else constructions, firstly a “switch (warpID)” probably would be much less ugly, but also I can’t imagine what your “reduce the hit result” function does that you can not just use e.g. 32*warpID as offset when accessing some shared array and not have any branches at all.
unsigned int warpID = threadIdx.x/WARP_SIZE;//warpSize is 32 currently
bool hit = testHit(...);
if ( 0==warpID )
{
  __syncthreads();
 //reduce the hit result of threads [0-31] and take a decision
}
else
{
 if ( 1==warpID )
 {
   __syncthreads();
   //reduce the hit result of threads [32-63] and take a decision
 }
 else
 {
   if ( 2==warpID )
   {
    __syncthreads();
    //reduce the hit result of threads [64-95] and take a decision
   }
   etc etc
 }
}
I was wondering if there is a more effective way to perform this skipping those if ( xxx=warpID )…
Of course, I could set 32 threads per block so __syncthreads() will be the syncwarp() I want… but 32 threads per block is not very optimal.
thx
[/quote]
DIs code will hang straightaway… Threads in Warp 0 will be waiting on a syncthread() statement while warp1 threads will be waiting on another syncthreads statement and so on – resulting in a kernel hang