Device memory r/w operations

Is it documented if operations that operate on device memory (read, write), whether that memory is accessed via the global or texture spaces, are asynchronous?

Consider the following code:

texture<float, 1, cudaReadModeElementType> data;

__global__ void opsGPU(float* d_out, int outSize)

{   

   const int blockStart = IMUL(blockIdx.x, THREADS_PER_BLOCK);

   const int writePos = blockStart + threadIdx.x;

  float in;

  if(writePos < outSize)

   {

      in = tex1Dfetch(data, writePos);

      

     ->(some operations that do not require the fetch, ops1)

    ->(some operations that require the fetch, ops2)

      

      d_out[writePos] = some_result;

   }

}

Will execution be returned to the thread for ops1 at which point the thread waits on the fetch and can be swapped out for a different one? Or do both ops1 and ops2 be waiting for the fetch? Could there be a compilation optimization in place where ops1 is swapped to be above the fetch?

In the same sense, is it reasonable to do:

d_out[threadPos] = d_in1[threadPos]-d_in2[threadPos]

or

float in1 = d_in1[threadPos];

float in2 = d_in2[threadPos];

d_out[threadPos] = in1-in2;

or does it all reduce to that during compilation? External Image

thanks for any replies,

Konstantinos