Different result between emulation and real intractable bug

Linh_Ha · December 13, 2007, 7:20am

I try PDE solver on cuda. The problem is my program run correctly on emulation mode, but give wrong result but some time right when i try in real mode. More surprisingly, the equivalent code in C, give different result in cuda like

if ((i == 1 || i == w - 2) && j > 0 && j < h-1)

        g_U1[g_pos] = value;

if ((i == 1) && j > 0 && j < h-1)

        g_U1[g_pos] = value;

if ((i == w-2) && j > 0 && j < h-1)

         g_U1[g_pos] = value;

I can not find the different between these 2 but real time result is. Some time the first one give correct result, but when i change the size the second one give correct. They are both give the same correct result in emulation mode

The pseudo of my code base on SOR method as follow

read data from global memory to shared mem

__syncthreads() // to ensure all data ready for next step

update the red point on the shared mem 

wririte red point to global mem 

__syncthreads() // to ensure all data ready for next step

update the black point to the shared mem 

write the black point to the global

__syncthreads()

There’s no write conflict in the same thread block in my code, each thread of the same block write to different mem pos. Red point was computed base on the black point only and vice versa.

Any experience with how to solve the problem ?

Thank you

MisterAnderson42 · December 13, 2007, 4:28pm

__syncthreads() only synchronizes threads in a block. It does not ensure that all global memory writes you have made take place. To do that, you need to let the kernel call finish and call the kernel again. You likely have a race condition where "update the black point to the shared mem " reads old data from the global memory array. It may work for you in emulation because the latencies for memory writes are much less and the warp size is 1.

Linh_Ha · December 13, 2007, 5:13pm

In this case i don’t compute the black point base on the red point on the global memory but the share memory, and i think when i use syncthreads, all writings to shared mem should be finished, so the data should be the new one.

Moreover, i only compute one loop that mean i finish then restart kernel so the value in the global mem for the next iteration should be the new one.

Topic		Replies	Views
CUDA deadlock issues in emulation mode CUDA Programming and Performance	5	3794	June 9, 2009
Bug in emulation mode and __syncthreads()? Kernel stops abruptly CUDA Programming and Performance	2	3005	May 14, 2009
using syncthreads still at n00b status CUDA Programming and Performance	4	16091	December 1, 2010
Incorrect use of syncthreads CUDA Programming and Performance	8	4915	April 7, 2008
Code work in Emulation mode but not with the Card... CUDA Programming and Performance	1	3062	July 29, 2010
__syncthreads screwes calculation CUDA Programming and Performance	2	3420	November 22, 2007
__syncthreads() + shared memory issue CUDA Programming and Performance	7	5704	August 26, 2008
The result is unpredictable. CUDA Programming and Performance	6	1160	October 25, 2013
CUDA Memory Consistency CUDA Programming and Performance	23	55938	March 8, 2007
syncthreads() issue CUDA Programming and Performance	3	1735	March 29, 2009

Different result between emulation and real intractable bug

Related topics