Syncthread and global memory

I have written a small kernel do sum 2^k elements using parallel reduction. Nothing new here…My vector is stored in global memory, I assign each part of the vector to a different block and reduce each block to a single position. The rest I do in CPU.

__global__ void sum(real *v, long int s){

    long int ix     =  threadIdx.x;
    long int shift = blockIdx.x*blockDim.x;

    long int h = blockDim.x/2;
    while (h >= 1){
        if (ix < h){
            v[ix +  shift] = v[2*ix + shift] + v[2*ix + 1 + shift];
        }
        __syncthreads(); 
        h = h / 2;
    }
}

The code works. However, after careful inspection, I realized that maybe it should not work. So I am confused… It could be that thread_id = 1, which sums elements 2 and 3, writes this its sum to position 1 before thread_id = 0 is able to read elements 0 and 1. Thus making the result invalid.

I would have assumed that, to be safe, the code would have to be

__global__ void sumsafe(real *v, long int s){
    long int ix     =  threadIdx.x;
    long int shift = blockIdx.x*blockDim.x;
    real x = 0;
    long int h = blockDim.x/2;
    while (h >= 1){
        if (ix < h){
            x = v[2*ix + shift] + v[2*ix + 1 + shift];
        }
        __syncthreads(); 
        if (ix < h){
            v[ix +  shift] = x;
        }
        __syncthreads();
        h = h / 2;
    }
}

so that I guarantee that all the threads read their values before they start changing them. But as I said…both codes work…their time is actually also pretty much the same.

Why is this?

I know that the GPU does not guarantee that what one thread writes to global memory is not visible to other threads. But it does not guarantee that this always never happens either.

Any ideas !? I am working on a GTX 1080.

cross posting:

http://stackoverflow.com/questions/41525245/syncthread-and-global-memory