In my opinion this code should deliver equivalent results:
if( threadIdx.x & 0x01 ){
x = devicememory[42];
}
else{
x = devicememory[4711];
}
It is not defined in which order the two branches will be processed, “so they both can be suspended simultaneous waiting for memory latency”. However from my understanding the compiler will insert a “point for reconvergence” right after the if-else construct, so there is no need to insert an explicit syncthreads(), it will happen implicitly. You might now argue that the compiler may do this but he also may place a “point for reconvergence” elsewhere. But where would be an alternative ? It doesn’t make sense to place it anywhere else, does it ?
My code example was not intended to be cleaver - it was intended to prove a point. Of course you can rewrite it not to prove my point, but I suggest you instead try to understand what I’m trying to explain. Thank you.
Neither was mine, I was also just trying to prove my point by slightly modifying the code structure.
OK, let me try to nail down the points each of us is trying to prove:
You say it is not defined at which point after a branch divergent threads reconverge again. If one needs to be sure to have uniform threads after conditional code one needs to insert an explicit syncthreads() (We are considering only threads within the same warp, right?).
Is this correct ?
I say the compiler will always insert a point for reconvergence right after conditional code which acts like a barrier for all threads that have executed the branch preceding the conditional code. So there is never a requirement for an explicit sync.
Now I want to prove that if your claim was right, the same code would indeed be executed twice, and not only unused SIMD slots would be lost:
All active threads that reach the branch will execute code block 0 uniformly. Now the threads diverge. One part of the threads is about to execute code block 1, the other part is about to execute code block 2. Now if you were right and the compiler would not have inserted a point for reconvergence just before code block 2 but instead before code block 3 or anywhere else in the subsequent code, the scheduler might decide to execute the threads that did not take the branch first, i.e. executing code block 2 (but only with the active threads, i.e. the threads that did not take the branch). Then, after code block 2 a point for reconvergence is reached and now the threads that did take the branch are being executed. First code block 1 is executed and then code block 2 is executed AGAIN for these threads.
Do you see that there is no reason at all to place the point of reconvergence anywhere else than right after the conditional code. You wouldn’t be able to to any latency hiding this way. For your code, the compiler would place a point for reconvergence there where you have placed the explicit syncthreads(). Because it wouldn’t make sense to place it anywhere else. And it has to be placed somewhere if the threads are to converge ever again.
What remains to be said is that I am only speculating, I do not know how reconvergence is done, that is why I rose this discussion (thank you for taking part kuisma). I would appreciate if someone from NVIDIA could clarify our speculations.
Yes, you understand me correct. I understand your point too, but I simply think you are wrong. But you are welcome to prove me wrong. :)
I think you overestimate the intelligence of compilers. The halting lemma states that there is no way to tell how, when and if your code block 1 will finish. The same goes for code block 2, of course. Even if you may have a good clue about what your code is supposed to do, do not assume the compiler have. Convergence checks must therefor be done run time by the scheduler, or forced by yourself with the __syncthread() instruction.
Image code block 1 takes one hour execution time for the threads fulfilling the if-condition, but the same threads do not perform any computation at all during code block 2.
Then image that the threads NOT fulfilling the the condition takes one hour to execute code block 2.
You see the benefit of a divergent wrap here? This way latencies caused by memory accesses can be hidden by parallel execution. In your case, the execution time would be twice(*) the optimal due to forced serial execution of a perfectly good parallel task.
Edit: (*) assumed the major part of the execution time is due to memory latency
My assumption is that it is not up to the scheduler to determine when and where divergent threads converge again, because of this quotation from the ptx guide:
If I understand this quotation right, points of reconvergence are determined at compile time and the scheduler will schedule the threads according to these points. What do you think ?
My understanding of CUDA threads within a half-warp is that all the SIMD units can really only execute the same, identical instruction (Single Instruction Multiple Data) for any given clock cycle. When threads diverge depending on some condition and execution is serialized, all SIMD units are still basically trying to execute the same instruction, but the units which are not part of the active divergent branch are simply disabled.
All this is automatically handled by the hardware at runtime, and thus there is no need for the compiler to insert any explicit “point for reconvergence” after a conditional. The __syncthreads() call is only needed when synchronizing execution between threads within different warps (of the same block). Threads within a half-warp are always in sync, but some of them might be disabled.
That’s my current interpretation… Am I totally off track?