I try PDE solver on cuda. The problem is my program run correctly on emulation mode, but give wrong result but some time right when i try in real mode. More surprisingly, the equivalent code in C, give different result in cuda like
if ((i == 1 || i == w - 2) && j > 0 && j < h-1)
g_U1[g_pos] = value;
if ((i == 1) && j > 0 && j < h-1)
g_U1[g_pos] = value;
if ((i == w-2) && j > 0 && j < h-1)
g_U1[g_pos] = value;
I can not find the different between these 2 but real time result is. Some time the first one give correct result, but when i change the size the second one give correct. They are both give the same correct result in emulation mode
The pseudo of my code base on SOR method as follow
read data from global memory to shared mem
__syncthreads() // to ensure all data ready for next step
update the red point on the shared mem
wririte red point to global mem
__syncthreads() // to ensure all data ready for next step
update the black point to the shared mem
write the black point to the global
__syncthreads()
There’s no write conflict in the same thread block in my code, each thread of the same block write to different mem pos. Red point was computed base on the black point only and vice versa.
Any experience with how to solve the problem ?
Thank you