I have a theory as to why the deadlock happens, and it has to do with divergent warps.
But first, I need to mention that I made a mistake with my previous example with atomicAdd(). While investigating, I discovered the reason it doesn’t lock up is that I had to change green to be not volatile, because atomicAdd() does not accept volatile arguments. And by making it not volatile, it totally messed up the functionality, so it would not actually protect the critical section at all. Oops!
Ok. Now for the lock-up problem.
Consider:
if (condition) {
a();
}
else {
b();
}
When branching, the processor has a choice between executing the true condition first, or the false condition first, or executing them both and time-slicing between the divergent branches. My thinking, based on the behavior is that it does not time slice between them, but instead executes them sequentially. When looking at the decuda output, it has “join” points which from the looks of it cause one branch to end, transferring control to the other branch so it can catch up.
So for example, the assembly code looks like this:
-
calculate predicate for condition
-
predicated branch to TrueLabel (begin divergence)
-
do stuff for b()
-
goto JoinLabel
-
TrueLabel:
-
do stuff for a()
-
JoinLabel:
-
join
When calculating divergent code, the processor will go through steps 1, and 2, and will start computing 5, knowing that it has to come back to 3 for those threads for which the condition was false. It will go to steps 6, 7, and 8, and when it reaches the join, it backtracks to 3, then 4, then 8 again. Since all the threads have reached 8, execution is no longer divergent and processing continues.
But what if “do stuff for a()” is in an infinite loop waiting for “do stuff for b()”? It’s a deadlock type of situation!
Now consider this:
-
load “green” from global memory
-
calculate predicate for green != threadIdx.x
-
predicated branch to step 1
-
add 1 to value of green
-
store to “green” in global memory
-
join
My theory is that execution goes step 1, 2, 3, and then the “true” fork gets executed, and the “false” fork gets deferred until the true fork finishes (reaches the join). The true fork loops back to 1, and gets stuck in the loop 1, 2, 3 forever. It’s deadlocked waiting for the false fork, and the false fork is waiting for the true fork to reach a join.
Hypothesizing that the true fork gets executed first, I tried this, and it does not hang:
__device__ int volatile green=0;
__global__ void hangkernel(void){
int executed = 0;
while (!executed) {
if (green == threadIdx.x) {
//critical section
++green;
executed = 1;
}
}
}
The lesson to learn from this is that mutexes within a warp are very dangerous and can deadlock easily.
I always understood why multi-block barriers are hazardous, and now I understand why tmurray said that mutexes were dangerous too! It is certainly not a good idea to depend on the order of the divergence because it can change easily depending on how the compiler decides to implement the logic. It can lead to silent problems that lurk until mutual exclusion is needed, and then hangs the program with deadlock. Ouch.