Compiler optimisation effecting code correctness

Hi,

I am using cuda 4.0 on C1060 GPU card. In my GPU kernel code I had one block, block1, waiting for another block, block2, to finish its task by waiting on a synchronisation variable declared in global memory, d_sync.

The variable is initialised as 0 before invoking the kernel and then its value is set to 1 only by block2 (when it is done with its task).

block1 synchronises using the following code:

if(threadIdx.x==0 && blockIdx.x == 0)

{

 while(d_sync != 1);

}

__syncthreads();

I had forgotten to declare d_sync as volatile.

Whenever I ran this code then I found that block1 never waited for block2 to finish its task, even when compiler optimization flag was set to -O0.

But when I ran it by compiling in debug mode (by using -g -G flags), it did wait for block2 to finish its task.

As per my understanding this is what is happening:

If there are no optimizations : This code should work as intended i.e. block1 will move out of the while loop when it sees the new value assigned to d_sync by block2 and thats why the code runs fine when compiled with -g flag.

If there are optimizations : Compiler sees that d_sync is not volatile and so can be replaced by a register, say reg1. It reads d_sync and puts it into reg1, which would have led to this code:

if(threadIdx.x == 0 && blockIdx.x == 0)

{

  reg1 = d_sync;

  while(reg1!=1);

}

__syncthreads();

Now it might realise that no one is modifying reg1 and so there is no point checking its value again and again, which should have led to something of this sort:

if(threadIdx.x == 0 && blockIdx.x == 0)

{

  reg1 = d_sync;

  if(reg1!=1)

  {

     while(1);

  }

}

__syncthreads();

But instead block1 always went through wihout waiting for block2 to set the value.

Even if I removed the piece of code which set the value of d_sync from 0 to 1 in block2, block1 still went through the loop.

Whereas I think that for this case block1 should have got blocked infinitely.

Though, I understand that I had made a mistake by not declaring d_sync as volatile which could have led to block1 never seeing the change in value of d_sync by block2 causing an indefinite wait, the optimized code behaved in completely opposite manner by going through successfully every time. This was a big reason why I never doubted that this could be happening because of non declaration of d_sync as volatile.

I believe this is an incorrect behaviour by compiler. In case you feel that the optimization by compiler is correct then I would be really interested in knowing the possible reason which I might have overlooked.

If I understand correctly your are trying to set up communication between different thread blocks within the same kernel launch. If so, this is not supported by the CUDA programming model. All thread blocks execute completely independently of each other, in no particular order.

Ok, so thread blocks are executed out of order.

But still, when I completely removed the piece of code that set the value of variable d_sync to 1, then why didn’t block1 go into an infinite wait?

Because a Tesla card is like a Cray under your desk, and a Cray executes an infinite loop in under 2 seconds. :smile:

Seriously, the infinite loop is empty, so the compiler probably just optimizes it away.

But by removing the condition the compiler is causing the subsequent piece of code to be executed which actually should have never been executed.
So compiler is changing the execution behavior (and result) of the program execution.
This should not happen as a result of optimization.

I’d disagree.
The C standard deliberately makes no guarantees with respect to timing. So you cannot rely on some part of the code to take such a long time to execute that subsequent code is never reached.