hiding global memory access do I need 2 warps?

I am trying to hide time-consuming global memory atomic operation with some independent computation.

Consider the following code:

__shared__ int shVar;

if (threadIdx.x==0)

  shVar=atomicAdd(&global,1);

[...]

//do some computation that does not read from global memory and does not depend on shVar

[...]

int something=globalArray[shVar];

[...]

If I assign a single warp to execute the above, will it wait at atomicAdd to return or is the compiler/driver/hardware smart enough to detect that it actually has to wait only later, at first reference to shVar?

In more general, are all global memory operations completly synchronious or can a warp continue its execution a bit, when it can?

One could easily replace the above example with:

__shared__ int shVar;

if (threadIdx.x==32)

  shVar=atomicAdd(&global,1);

if (threadIdx.x<32) {

  [...]

  //do some computation that does not read from global memory and does not depend on shVar

  [...]

}

__syncthreads();

int something=globalArray[shVar];

[...]

… and assign 2 warps to do it, but since we do not know in which order warps get executed I have no guarantee that atomic operation will be issued before my computation. In worst case scenario, warp 0 could reach __syncthreads() barrier while warp 1 did not even started the atomic operation!

Or maybe there is third solution?

I also wanted to do something similar. However, it did not give the expected results. May be the compiler (or CUDA itself) automatically optimizes the global memory access so that the data is fetched beforehand.

This kind of tricks usually do not give the expected results probably due to strange optimizations.