global to shared mem loads and sync

quak · February 19, 2008, 5:45pm

In my opinion this code should deliver equivalent results:

if( threadIdx.x & 0x01 ){

x = devicememory[42];

}

else{

x = devicememory[4711];

}

It is not defined in which order the two branches will be processed, “so they both can be suspended simultaneous waiting for memory latency”. However from my understanding the compiler will insert a “point for reconvergence” right after the if-else construct, so there is no need to insert an explicit syncthreads(), it will happen implicitly. You might now argue that the compiler may do this but he also may place a “point for reconvergence” elsewhere. But where would be an alternative ? It doesn’t make sense to place it anywhere else, does it ?

kuisma · February 19, 2008, 6:40pm

My code example was not intended to be cleaver - it was intended to prove a point. Of course you can rewrite it not to prove my point, but I suggest you instead try to understand what I’m trying to explain. Thank you.

quak · February 20, 2008, 10:00am

Neither was mine, I was also just trying to prove my point by slightly modifying the code structure.

OK, let me try to nail down the points each of us is trying to prove:

You say it is not defined at which point after a branch divergent threads reconverge again. If one needs to be sure to have uniform threads after conditional code one needs to insert an explicit syncthreads() (We are considering only threads within the same warp, right?).

Is this correct ?

I say the compiler will always insert a point for reconvergence right after conditional code which acts like a barrier for all threads that have executed the branch preceding the conditional code. So there is never a requirement for an explicit sync.

Now I want to prove that if your claim was right, the same code would indeed be executed twice, and not only unused SIMD slots would be lost:

//code block 0

if( condition ){

//code block 1

}

//code block 2

//code block 3

All active threads that reach the branch will execute code block 0 uniformly. Now the threads diverge. One part of the threads is about to execute code block 1, the other part is about to execute code block 2. Now if you were right and the compiler would not have inserted a point for reconvergence just before code block 2 but instead before code block 3 or anywhere else in the subsequent code, the scheduler might decide to execute the threads that did not take the branch first, i.e. executing code block 2 (but only with the active threads, i.e. the threads that did not take the branch). Then, after code block 2 a point for reconvergence is reached and now the threads that did take the branch are being executed. First code block 1 is executed and then code block 2 is executed AGAIN for these threads.

Do you see that there is no reason at all to place the point of reconvergence anywhere else than right after the conditional code. You wouldn’t be able to to any latency hiding this way. For your code, the compiler would place a point for reconvergence there where you have placed the explicit syncthreads(). Because it wouldn’t make sense to place it anywhere else. And it has to be placed somewhere if the threads are to converge ever again.

What remains to be said is that I am only speculating, I do not know how reconvergence is done, that is why I rose this discussion (thank you for taking part kuisma). I would appreciate if someone from NVIDIA could clarify our speculations.

quak

kuisma · February 20, 2008, 10:41am

Yes, you understand me correct. I understand your point too, but I simply think you are wrong. But you are welcome to prove me wrong. :)

I think you overestimate the intelligence of compilers. The halting lemma states that there is no way to tell how, when and if your code block 1 will finish. The same goes for code block 2, of course. Even if you may have a good clue about what your code is supposed to do, do not assume the compiler have. Convergence checks must therefor be done run time by the scheduler, or forced by yourself with the __syncthread() instruction.

Image code block 1 takes one hour execution time for the threads fulfilling the if-condition, but the same threads do not perform any computation at all during code block 2.
Then image that the threads NOT fulfilling the the condition takes one hour to execute code block 2.

You see the benefit of a divergent wrap here? This way latencies caused by memory accesses can be hidden by parallel execution. In your case, the execution time would be twice(*) the optimal due to forced serial execution of a perfectly good parallel task.

Edit: (*) assumed the major part of the execution time is due to memory latency

quak · February 20, 2008, 7:20pm

My assumption is that it is not up to the scheduler to determine when and where divergent threads converge again, because of this quotation from the ptx guide:

If I understand this quotation right, points of reconvergence are determined at compile time and the scheduler will schedule the threads according to these points. What do you think ?

kuisma · February 20, 2008, 8:48pm

It sure sounds like that. But in that case, it is just to have i look in the PTX code.

lars · February 21, 2008, 6:04am

Hi,

My understanding of CUDA threads within a half-warp is that all the SIMD units can really only execute the same, identical instruction (Single Instruction Multiple Data) for any given clock cycle. When threads diverge depending on some condition and execution is serialized, all SIMD units are still basically trying to execute the same instruction, but the units which are not part of the active divergent branch are simply disabled.

All this is automatically handled by the hardware at runtime, and thus there is no need for the compiler to insert any explicit “point for reconvergence” after a conditional. The __syncthreads() call is only needed when synchronizing execution between threads within different warps (of the same block). Threads within a half-warp are always in sync, but some of them might be disabled.

That’s my current interpretation… Am I totally off track?

/Lars

Topic		Replies	Views
order of execution in a divergent warp CUDA Programming and Performance	27	4637	January 17, 2012
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	16889	January 5, 2008
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21275	March 12, 2007
Warp-synchronous programming in the presence of warp divergence CUDA Programming and Performance	11	3113	April 18, 2011
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8789	December 1, 2009
Problem with correct branching within a warp CUDA Programming and Performance	23	15710	May 28, 2009
Back to SIMD CUDA Programming and Performance	21	294	November 12, 2024
Possible synchronization bug on Maxwell? CUDA Programming and Performance	10	1648	October 13, 2014
CUDA warp branching problem Stange bug due to unexpected warp branching order CUDA Programming and Performance	17	12147	July 11, 2010
32 thread block doesn't need _syncthreads()? CUDA Programming and Performance	18	13804	January 21, 2024

global to shared mem loads and sync

Related topics