Hi,
I am using cuda 4.0 on C1060 GPU card. In my GPU kernel code I had one block, block1, waiting for another block, block2, to finish its task by waiting on a synchronisation variable declared in global memory, d_sync.
The variable is initialised as 0 before invoking the kernel and then its value is set to 1 only by block2 (when it is done with its task).
block1 synchronises using the following code:
if(threadIdx.x==0 && blockIdx.x == 0)
{
while(d_sync != 1);
}
__syncthreads();
I had forgotten to declare d_sync as volatile.
Whenever I ran this code then I found that block1 never waited for block2 to finish its task, even when compiler optimization flag was set to -O0.
But when I ran it by compiling in debug mode (by using -g -G flags), it did wait for block2 to finish its task.
As per my understanding this is what is happening:
If there are no optimizations : This code should work as intended i.e. block1 will move out of the while loop when it sees the new value assigned to d_sync by block2 and thats why the code runs fine when compiled with -g flag.
If there are optimizations : Compiler sees that d_sync is not volatile and so can be replaced by a register, say reg1. It reads d_sync and puts it into reg1, which would have led to this code:
if(threadIdx.x == 0 && blockIdx.x == 0)
{
reg1 = d_sync;
while(reg1!=1);
}
__syncthreads();
Now it might realise that no one is modifying reg1 and so there is no point checking its value again and again, which should have led to something of this sort:
if(threadIdx.x == 0 && blockIdx.x == 0)
{
reg1 = d_sync;
if(reg1!=1)
{
while(1);
}
}
__syncthreads();
But instead block1 always went through wihout waiting for block2 to set the value.
Even if I removed the piece of code which set the value of d_sync from 0 to 1 in block2, block1 still went through the loop.
Whereas I think that for this case block1 should have got blocked infinitely.
Though, I understand that I had made a mistake by not declaring d_sync as volatile which could have led to block1 never seeing the change in value of d_sync by block2 causing an indefinite wait, the optimized code behaved in completely opposite manner by going through successfully every time. This was a big reason why I never doubted that this could be happening because of non declaration of d_sync as volatile.
I believe this is an incorrect behaviour by compiler. In case you feel that the optimization by compiler is correct then I would be really interested in knowing the possible reason which I might have overlooked.