warp synchronization test

I write a simple code to test if the threads in the same warp is synchronized automatically

#include <stdio.h>
#include <math.h>
#include <time.h>
#include <stdlib.h>

#define in 32
#define dimBx 32
#define dimGx ((in + dimBx -1)/dimBx)

__device__ double d_w[in];

__global__ void G_te_x()
{
  int id,idx;
  int i = blockIdx.x*blockDim.x + threadIdx.x - blockIdx.x;
  __shared__ double s_flux[dimBx];  
  __syncthreads();
  id = i;
  idx = threadIdx.x;
  s_flux[idx] = 1.0;
  d_w[id] = 0.3;
  if (idx<blockDim.x-1 && i<in-1){
    d_w[id] = 7.0 + s_flux[idx+1];
  }
}
int main()
{
  int i;
  double h_w[32];
  dim3 dim_G(dimGx);
  dim3 dim_B(dimBx);
  G_te_x<<<dim_G, dim_B>>>();
  cudaDeviceSynchronize();
  cudaMemcpyFromSymbol(h_w, d_w, in*sizeof(double));
  for(i=0;i<in;i++){
    printf("a=%.15f,%d\n",h_w[i],i);
  }
  return 0;
}

All 32 threads in the same warp. I thought the result was h_w[0~30]=8.0, h_w[31]=0.3;
but the result show h_w[0~30]=7.0, h_w[31]=0.3
That means the threads are not synchronized. Are there any problems in my program?

I am sorry to the one answer my question, i just press the wrong option.
i lost a __syncthread(); statement(just added) in the code. Though it seems have no effect, such statement make the result h_w[0~30]=7.0

Compile command: nvcc main_C.cu -lm -o ffe -arch compute_20 -code sm_20 --ptxas-options=-v
CUDA version: 5.0
GPU: NVIDIA C2050

Shared memory updates are only guaranteed to be “visible” to the thread they were issued in. To make them visible to other threads in the block, it is necessary to have a synchronization point, of some sort. This is not actually a synchronization issue, but a compiler optimization issue. Specifically this update:

s_flux[idx] = 1.0;

will not necessarily be visible to any other thread reading the value here:

d_w[id] = 7.0 + s_flux[idx+1];

unless a specific synchronization or barrier is established between those two points in the code.

The compiler is free to optimize shared memory values into registers if it so chooses. Those optimizations can be performed at any time. In order to force an update from another thread to be visible, you can do one of the following things:

  1. compile your code with -G, which disables most optimizations
  2. add the “volatile” keyword to the shared memory declaration, which forces the compiler to forgo this optimization in general (any shared memory transaction will be facilitated by an access to shared memory, rather than usage of a register copy.)
  3. add a __syncthreads() after the update of shared memory:

s_flux[idx] = 1.0;
__syncthreads();

Not only is __syncthreads() a thread barrier, but it also has the effect of forcing shared memory updates to be visible to all threads in the block. Note the programming guide description of __syncthreads():

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

“…and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.”

So not only is it a thread execution barrier, but it is a memory barrier as well. (An explicit memory barrier function, such as __threadfence_block() would work here also.)

  1. you could also have each thread update the shared memory value it will actually use later:

s_flux[(idx+1)%blockDim.x] = 1.0;

since that value is then guaranteed to be visible to that thread.

The addition or removal of the __syncthreads() that you have in your code does affect the issue, but only as a side-effect of the compiler optimization choices.

You can also spot this issue by running cuda-memcheck with the --tool racecheck option. When I do this on CUDA 6.5 it reports something like this:

========= Race reported between Write access at 0x00000090 in G_te_x(void)
========= and Read access at 0x000000a8 in G_te_x(void) [248 hazards]

========= RACECHECK SUMMARY: 1 hazard displayed (1 error, 0 warnings)

but if I add the __syncthreads() after the update to shared memory (2 above), or have each thread update the value it will use later (4 above), I get no warnings.

That’s an interesting and important point about the lack of intrawarp shared memory read-after-write guarantee being a compiler issue. In the past, the use of implicitly synchronized intrawarp shared memory accesses was advocated by NVIDIA as a significant speedup in many contexts. See for example Mark Harris’s optimization tutorials, where using the dependency on implied synchronization gives about a 2X speedup on a reduction kernel.

However a single paragraph in the Kepler Tuning Guide (of all places) now recommends against the paradigm.

I always regarded the Kepler Tuning Guide documentation warning as just a hint for future-proofing code for new as-yet-unreleased architectures. But txbob points out that the compiler (as of 6.0??) can now produce code which breaks that assumption… that’s a surprise to me even with the warning that quietly appeared last year. (Shouldn’t this be mentioned in the Programming Guide?)

I now need to revisit some of my code where I know I depend on intrawarp shared memory consistency. It may be worthwhile to even use the rare and advanced PTX level named synchronization barrier bar.sync to allow fine grained intrawarp barriers.

Perhaps you might be more specific about why you think Mark Harris’ tutorial that you linked is inconsistent. On slide 22 he says:

“IMPORTANT:
For this to be correct,
we must use the
“volatile” keyword!”

(this refers to unrolling the last warp, where the __syncthreads() construct is dispensed with. This is exactly where the issue described might arise, and the volatile qualifier addresses it. Previous reduction work in this slide deck treatment was covered by the __syncthreads() memory barrier function, and so volatile was unnecessary up to this point.)

Use of the volatile keyword is one of the approaches I mentioned to deal with this issue, and the programming guide has for some time had a description about it that I think is instructive:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#volatile-qualifier

"E.2.2.3. Volatile Qualifier

The compiler is free to optimize reads and writes to global or shared memory (for example, by caching global reads into registers or L1 cache) as long as it respects the memory ordering semantics of memory fence functions (Memory Fence Functions) and memory visibility semantics of synchronization functions (Synchronization Functions).

These optimizations can be disabled using the volatile keyword: If a variable located in global or shared memory is declared as volatile, the compiler assumes that its value can be changed or used at any time by another thread and therefore any reference to this variable compiles to an actual memory read or write instruction."

It’s true that the above description was re-worded slightly in CUDA 6.5, but the previous versions certainly carried much of the same, e.g. CUDA 5:

“D.2.2.2 Volatile Qualifier
Only after the execution of a __threadfence_block(), __threadfence(), or
__syncthreads() (Memory Fence Functions and Synchronization Functions) are prior
writes to global or shared memory guaranteed to be visible by other threads. As long
as this requirement is met, the compiler is free to optimize reads and writes to global or
shared memory.
This behavior can be changed using the volatile keyword: If a variable located in
global or shared memory is declared as volatile, the compiler assumes that its value can
be changed or used at any time by another thread and therefore any reference to this
variable compiles to an actual memory read or write instruction.
For example, in the code sample of Synchronization Instruction, if s_ptr were not
declared as volatile, the compiler would optimize away the store to shared memory for
each assignment to s_ptr[tid]. It would accumulate the result into a register instead
and only store the final result to shared memory, which would be incorrect.”

I do not think anything has fundamentally changed in the compiler in this regard. As I have pointed out multiple times in the past, the CUDA compiler’s view of the world is essentially that of a single thread, since thread-block organization is not known at compile time. There are some exceptions to this, for example the compiler can sometimes establish that an access is uniform across all threads in a thread block, and this has been exploited in the past to generate uniform loads (LDU) for older GPU architectures.

This single-threaded world view very much applies to the determination of data dependencies in the code, which uses ordinary C/C++ semantics. So, for example, locations arr[tid+1], arr[tid+2], arr[tid+4] are considered independent of location arr[tid]. In the absence of known or possible (e.g. via potential pointer aliasing) data dependencies, the compiler often has significant room for re-ordering or eliminating loads and stores.

What has changed in recent years is that the compiler does an increasingly better job of extracting performance by exploiting the freedom in load/store handling it has always had. This could, on occasion, lead to breaks in existing code that does not properly account for inter-thread data dependencies (whether the dependence is intra-warp makes no difference in this process). This may have included example code provided in older NVIDIA presentations or documentation. I would assume that any such instances have been corrected at this time.