I’m surprised. Cuda by Example really has code that fails in such an obvious way?
while( atomicCAS(&mutex, 0, 1) != 0);
is a straight deadlock in CUDA. At most one thread can grab the lock, all others have to spin in the loop. However, since all threads of a warp execute in lockstep, the thread that owns the lock cannot proceed to release the lock until all other threads do as well, which never happens.
I have examined some simple cases for Fermi. It is actually pretty well defined - the first BRA always takes immediate effect (whenever there is a BRA, the branch indicated by that BRA is executed first). The problem with the current compiler (EDIT: not current, it’s from 4.0 toolkit) is that the first BRA leads to the branch that attempts to lock again, not the branch that unlocks. Inserting a BRA gets things fixed. Of course, that extra BRA may not be desirable in all cases.
Oh, I understand you are too familiar with BRAs. But my point is that “Programmers should not assume about which BRA will be selected by the hardware at run-time”. The hardware may change its choice in future and You should not speculate on it as a programmer…
I’m sorry if I said something stupid, but let’s not start a war on this. I was only trying to give some information :) You’re totally right that programmers shouldn’t be concerned with this. The compiler guys should have got it right in the first place. However I’m not sure if this can be considered as a bug. Perhaps the compiler guys made a conscious choice for other reasons.