CUDA BUG? Shared memory contents differ across threads __syncthreads() not working???

E.D_Riedijk · September 8, 2009, 9:41am

Hi I have a piece of code where each thread writes to it’s own shared memory location. Afterwards I perform a __syncthreads(); and then read shared memory (also from the neighbours). Unfortunately one thread gets different values as another threads, even though the element read should be the same. The code follows:

__global__ void facet_kernel (unsigned int blocks_offset, int2 g_blocklist){

__shared__ int s_block_x, s_block_y;

  __shared__ unsigned int s_facet_hit[16*16];

	unsigned int tid = threadIdx.x + threadIdx.y * 16;

	if (tid==0) {

	  int2 number = g_blocklist[blockIdx.x + blocks_offset];

	  s_block_x = number.x;

	  s_block_y = number.y;

  }

//	s_facet_hit[tid] = 0; // normally we do not hit a facet

  s_facet_hit[tid] = tid; // For debugging purposes

  __syncthreads();

unsigned int reflection_plane;

bool found_reflection = false;

// CODE TO FIND REFLECTIONS

 // SKIPPED IT HERE

if (found_reflection) {

	  s_facet_hit[tid] = reflection_plane + 1;

  }

   __syncthreads();

		// Write thread indices of a 3x3 block around the current thread (on the right some extra debugging output)

	if ((s_block_x==-2) && (s_block_y==-44) && (tid==136)) {

	  unsigned int index = atomicInc(d_debug_offset, gridDim.x*14*14);

	  d_debug[index] = make_float4((float) tid-17, \

			  (float) tid-16, \

			  (float) tid-15, \

			  (float) s_facet_hit[tid+16]);

	  unsigned int index_1 = atomicInc(d_debug_offset, gridDim.x*14*14);

	  d_debug[index_1] = make_float4((float) tid-1, \

			  (float) tid, \

			  (float) tid+1, \

			  (float) s_facet_hit[tid+16]);

	  unsigned int index_2 = atomicInc(d_debug_offset, gridDim.x*14*14);

	  d_debug[index_2] = make_float4((float) tid+15, \

			  (float) tid+16, \

			  (float) tid+17, \

			  (float) s_facet_hit[tid+16]);

		// Write the values in shared memory of a 3x3 block around the current thread (on the right some extra debugging output)

	if ((s_block_x==-2) && (s_block_y==-44) && (tid==136)) {

	  unsigned int index = atomicInc(d_debug_offset, gridDim.x*14*14);

	  d_debug[index] = make_float4((float) s_facet_hit[tid-17], \

			  (float) s_facet_hit[tid-16], \

			  (float) s_facet_hit[tid-15], \

			  (float) s_facet_hit[tid+16]);

	  unsigned int index_1 = atomicInc(d_debug_offset, gridDim.x*14*14);

	  d_debug[index_1] = make_float4((float) s_facet_hit[tid-1], \

			  (float) s_facet_hit[tid], \

			  (float) s_facet_hit[tid+1], \

			  (float) s_facet_hit[tid+32]);

	  unsigned int index_2 = atomicInc(d_debug_offset, gridDim.x*14*14);

	  d_debug[index_2] = make_float4((float) s_facet_hit[tid+15], \

			  (float) s_facet_hit[tid+16], \

			  (float) s_facet_hit[tid+17], \

			  (float) s_facet_hit[tid+33]);

  }

  }

}

Blocksize is 16x16, so tid has the value:

0 1 ... 14 15

16 ..		 31

.

.

.

240 ....	255

Output is

tid==136

 135.0000  136.0000  137.0000	4.0000

 151.0000  152.0000  153.0000	4.0000

 167.0000  168.0000  169.0000	4.0000

135.0000  136.0000  137.0000	4.0000

 151.0000	4.0000	153.0000	4.0000

 167.0000	4.0000	4.0000		4.0000

tid == 152

 119.0000  120.0000  121.0000  152.0000

 135.0000  136.0000  137.0000  152.0000

 151.0000  152.0000  153.0000  152.0000

119.0000  120.0000  121.0000  152.0000

 135.0000  136.0000  137.0000  168.0000

 151.0000  152.0000  153.0000  169.0000

As you can see, in the case for tid==152, the value at index 152 remains unchanged (152), whereas for tid==136, the value is overwritten with 4 (reflection_plane==3)

Given the __syncthreads(); just before the code where debugging-output is extracted, this seems to me like a CUDA bug.

On XP64:

Driver 185.85

nvcc --version

nvcc: NVIDIA Â® Cuda compiler driver

Built on Sat_May__2_05:54:12_PDT_2009

Cuda compilation tools, release 2.2, V0.2.1221

Has the bug. Upgrading XP64 to the latest toolkit & driver fixes the problem. Linux64 with the latest toolkit & driver is also correctly functioning. Was there a bug with __syncthreads() fixed between 2.2 and 2.3???

P.S. I cannot change the topic-name unfortunately

E.D_Riedijk · September 10, 2009, 1:18pm

I am running into the same problem again, now after I have put a __syncthreads() into a device function. It looks as if __syncthreads(); statements in device functions are ignored, which in my case would mean a LOT of code duplication.
Can anybody from NVIDIA confirm that __syncthreads(); should only be used in global functions? The programming guide does not mention that, but I only see examples where __syncthreads() is used in global functions.

Grtz,
Dennis

Topic		Replies	Views
using syncthreads still at n00b status CUDA Programming and Performance	4	16091	December 1, 2010
__syncthreads() + shared memory issue CUDA Programming and Performance	7	5704	August 26, 2008
Shared Memory Problems - __syncthreads() doesn't work? CUDA Programming and Performance	5	2675	December 29, 2011
syncthreads() issue CUDA Programming and Performance	3	1735	March 29, 2009
different results why? CUDA Programming and Performance	3	1497	April 23, 2010
Shared memory and __syncthreads CUDA Programming and Performance	1	4147	August 4, 2008
shared memory and __syncthreads() one writer, n readers CUDA Programming and Performance	5	3034	August 25, 2008
Cuda: threads over 2 warps not synchronising correctly Legacy PGI Compilers	5	6960	May 26, 2011
Is __syncthreads compatibility with CUDA 1.1? CUDA Programming and Performance	4	1075	August 9, 2013
Shared Memory Problems - __syncthreads() doesn't work? CUDA Programming and Performance	0	2275	December 25, 2011

CUDA BUG? Shared memory contents differ across threads __syncthreads() not working???

Related topics