GPU synchronization threadfence()

I tried to implement the GPU synchronization method introduced by " On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit " (http://synergy.cs.vt.edu/pubs/papers/xiao-icpads2009-gpu.pdf). The method is very similar with the code sample on p. 111 of Programming Guide Version 2.3.1.

I employed the synchronization function given by Figure 7 of the paper in my kernel, and it worked correctly when the dimension of the matrix was smaller than 256256. However, when the dimension of the matrix is 256256 or greater, the program seems to never stop. So I wrote a very easy function to test the synchronization function, in which 1 is added to each element of the matrix in each iteration. It couldn’t work either, when the dimension of the matrix is increased.

I have attached my code and hope someone could help. Thanks a lot.

BTW: GPU1.cu contains the main function. Device_MatrixUtilities.cu includes the synchronization function device void __GPU_sync(int goalVal), the kernel global void Test(float *U) and other related functions. The file header.h contains the definition of the dimension of the matrix, block and grid.
header.h (413 Bytes)
Device_MatrixUtilities.cu (1.73 KB)
GPU1.cu (1.86 KB)

your code has race condition

// GPU synchronization function (one dimension)

__device__ void __GPU_sync(int goalVal)

{

	//thread ID in a block

	int tid = threadIdx.x;

	// memory flush to all threads

	__threadfence();

	// only thread 0 is used for

	// synchronization

	if ( tid == 0) 

	{

	   atomicAdd (( int *)& g_mutex,1);

	   

	   // only when all blocks add 1 to g_mutex , will it be equal to goalVal

	   while ( g_mutex != goalVal )   // how to make sure g_mutex is what you want

	   {

		 // Do nothing

	   }

	}

	__syncthreads();

 }

try

__device__ void __GPU_sync(int goalVal)

{

	//thread ID in a block

	int tid = threadIdx.x;

	// memory flush to all threads

	__threadfence();

	// only thread 0 is used for

	// synchronization

	if ( tid == 0) 

	{

	   int old_val = atomicAdd (( int *)& g_mutex,1);

	   

	   // only when all blocks add 1 to g_mutex , will it be equal to goalVal

	   while ( old_val != goalVal )

	   {

		 // Do nothing

	   }

	}

	__syncthreads();

 }

Thanks a lot for your great help!