I try PDE solver on cuda. The problem is my program run correctly on emulation mode, but give wrong result but some time right when i try in real mode. More surprisingly, the equivalent code in C, give different result in cuda like
if ((i == 1 || i == w - 2) && j > 0 && j < h-1)
g_U1[g_pos] = value;
I can not find the different between these 2 but real time result is. Some time the first one give correct result, but when i change the size the second one give correct. They are both give the same correct result in emulation mode
The pseudo of my code base on SOR method as follow
read data from global memory to shared mem
__syncthreads() // to ensure all data ready for next step
update the red point on the shared mem
wririte red point to global mem
__syncthreads() // to ensure all data ready for next step
update the black point to the shared mem
write the black point to the global
__syncthreads()
There’s no write conflict in the same thread block in my code, each thread of the same block write to different mem pos. Red point was computed base on the black point only and vice versa.
__syncthreads() only synchronizes threads in a block. It does not ensure that all global memory writes you have made take place. To do that, you need to let the kernel call finish and call the kernel again. You likely have a race condition where "update the black point to the shared mem " reads old data from the global memory array. It may work for you in emulation because the latencies for memory writes are much less and the warp size is 1.
In this case i don’t compute the black point base on the red point on the global memory but the share memory, and i think when i use syncthreads, all writings to shared mem should be finished, so the data should be the new one.
Moreover, i only compute one loop that mean i finish then restart kernel so the value in the global mem for the next iteration should be the new one.