I have the program in the following:

global void incrementArrayOnDevice(float *a,int N)
int i,y,z;
volatile int x;
y = 3; z = 10;
for(i = 0 ; i <(10- threadIdx.x)*10000 ; i++)
x = y+z;
a[0 ] = threadIdx.x;
I have called kernel<<<1,10>>>(d,N);

when I tried to print the h[0] , the answer is always 0.0000

I assume the answer should be 9.00000 , but after many testing times. the output is always 0.000

I think thread-9 should finish at last, since it does more loops.

The main confusing problem is that how does the hardware do the synchronization ?

In other words, how does the hardware knows some threads have to wait for other threads to finish their tasks ?

Like the above program, some threads do fewer jobs while some threads do more, but the value

of h[0] is still zero, it’s quite weird ??

You can’t assume that code will be compiled run and literally.
The compiler will optimize the global memory store out of the loop (it may even eliminate the loop all together), so each store will happen only once, and the order of the store will be whatever the architecture chooses. Certainly not predictable a priori.

Even if the code did run as written all threads within the warp would run until the loop is finished for all threads within the warp. The global memory transaction is then executed in parallel and rhe result it takes is undefined.


  1. All threads within the warp execute the same instruction (except in branch cases where they should synchronize at the end of the branch).
  2. All warps within a block execute in parallel. A __syncthreads() command can be used to prevent warps from contiuning until all warps have reached a point.
  3. Blocks aren’t synchronised in a defined way during kernel execution.