Different result between emulation and real intractable bug

I try PDE solver on cuda. The problem is my program run correctly on emulation mode, but give wrong result but some time right when i try in real mode. More surprisingly, the equivalent code in C, give different result in cuda like

if ((i == 1 || i == w - 2) && j > 0 && j < h-1)

        g_U1[g_pos] = value;
if ((i == 1) && j > 0 && j < h-1)

        g_U1[g_pos] = value;

if ((i == w-2) && j > 0 && j < h-1)

         g_U1[g_pos] = value;

I can not find the different between these 2 but real time result is. Some time the first one give correct result, but when i change the size the second one give correct. They are both give the same correct result in emulation mode

The pseudo of my code base on SOR method as follow

read data from global memory to shared mem

__syncthreads() // to ensure all data ready for next step

update the red point on the shared mem 

wririte red point to global mem 

__syncthreads() // to ensure all data ready for next step

update the black point to the shared mem 

write the black point to the global

__syncthreads()

There’s no write conflict in the same thread block in my code, each thread of the same block write to different mem pos. Red point was computed base on the black point only and vice versa.

Any experience with how to solve the problem ?

Thank you

__syncthreads() only synchronizes threads in a block. It does not ensure that all global memory writes you have made take place. To do that, you need to let the kernel call finish and call the kernel again. You likely have a race condition where "update the black point to the shared mem " reads old data from the global memory array. It may work for you in emulation because the latencies for memory writes are much less and the warp size is 1.

In this case i don’t compute the black point base on the red point on the global memory but the share memory, and i think when i use syncthreads, all writings to shared mem should be finished, so the data should be the new one.

Moreover, i only compute one loop that mean i finish then restart kernel so the value in the global mem for the next iteration should be the new one.