Parallel Reduction release vs debug

Hi, I have a problem with this reduction kernel. When I compiled it without the -G flag I got weird results.
The kernel is launched with the sequence gmem = {0, …, 31}.
When I run the debug version all elements contain 31.
And when I run the non-debug version the result is:

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
24 25 26 27 28 29 30 31 28 29 30 31 30 31 31 31
__shared__ float smem[smemSize];

	unsigned int tx = threadIdx.x;

	smem[tx] = gmem[idx];
	__syncthreads();

	if(smemSize>= 512) 
	{ if(tx < 256) { smem[tx] = max(smem[tx], smem[tx + 256]); } __syncthreads(); }
	if(smemSize>= 256) 
	{ if(tx < 128) { smem[tx] = max(smem[tx], smem[tx + 128]); } __syncthreads(); }
	if(smemSize>= 128) 
	{ if(tx <  64) { smem[tx] = max(smem[tx], smem[tx +  64]); } __syncthreads(); }

	if (tx < 32)
	{
		if(smemSize >= 64) smem[tx] = max(smem[tx], smem[tx + 32]);
		if(smemSize >= 32) smem[tx] = max(smem[tx], smem[tx + 16]);
		if(smemSize >= 16) smem[tx] = max(smem[tx], smem[tx +  8]);
		if(smemSize >=  8) smem[tx] = max(smem[tx], smem[tx +  4]);
		if(smemSize >=  4) smem[tx] = max(smem[tx], smem[tx +  2]);
		if(smemSize >=  2) smem[tx] = max(smem[tx], smem[tx +  1]);
	}

	gmem[tx] = smem[tx];
}

I have added __syncthreads(); after each line whitin the if block. Then the results are correct.
I don’t understand why is it required, since the threads within a warp are synchronous.
Thank you for your help.

declare smem as ‘volatile’ like this

shared volatile float smem[smemSize];

then try and report your results

The compiler knows nothing about the mapping of threads to data, as this is a function of the run-time launch configuration. As a consequence it has a single-threaded view of the world, except for those places where it can prove that certain operations are uniform across the entire thread block. So in particular, there is no notion of warp at compile time.

In the code at hand, data and control dependencies allow the compiler to optimize as follows for an optimized / release build (for example, to schedule loads as early as possible to help cover load latency):

if (tx < 32) {
  float t32 = smem[tx+32];
  float t16 = smem[tx+16];
  float t8 = smem[tx+8];
  float t4 = smem[tx+4];
  float t2 = smem[tx+2];
  float t1 = smem[tx+1];
  float t0 = smem[tx];
  if (smemSize >= 64) t0 = max(t0, t32);
  if (smemSize >= 32) t0 = max(t0, t16);
  if (smemSize >= 16) t0 = max(t0, t8);
  if (smemSize >=  8) t0 = max(t0, t4);
  if (smemSize >=  4) t0 = max(t0, t2);
  if (smemSize >=  2) t0 = max(t0, t1);
  smem[tx] = t0;
}

The use of __syncthreads() fixes the resulting functionality issue because the shared memory loads cannot be moved across this barrier.

that looks weird as well, it should look more like this;

best[threadIdx.x]=tnum;
	__syncthreads();

	//assuming 256 THREADS, change if different
	if(threadIdx.x<128){
		best[threadIdx.x] = (best[threadIdx.x+128] > best[threadIdx.x]) ? best[threadIdx.x+128] : best[threadIdx.x];
	}
	__syncthreads();
	if(threadIdx.x<64){
		best[threadIdx.x] = (best[threadIdx.x+64] > best[threadIdx.x]) ? best[threadIdx.x+64] : best[threadIdx.x];
	}
	__syncthreads();
	if(threadIdx.x<32){
		best[threadIdx.x] = best[threadIdx.x+32] > best[threadIdx.x] ? best[threadIdx.x+32] : best[threadIdx.x];
		best[threadIdx.x] = best[threadIdx.x+16] > best[threadIdx.x] ? best[threadIdx.x+16] : best[threadIdx.x];
		best[threadIdx.x] = best[threadIdx.x+8] > best[threadIdx.x] ? best[threadIdx.x+8] : best[threadIdx.x];
		best[threadIdx.x] = best[threadIdx.x+4] > best[threadIdx.x] ? best[threadIdx.x+4] : best[threadIdx.x];
		best[threadIdx.x] = best[threadIdx.x+2] > best[threadIdx.x] ? best[threadIdx.x+2] : best[threadIdx.x];
		best[threadIdx.x] = best[threadIdx.x+1] > best[threadIdx.x] ? best[threadIdx.x+1] : best[threadIdx.x];	
	}
	__syncthreads();

	if(threadIdx.x==0){
		ans_val[(permIdx>>8)]=best[0];
	}

I have used this parallel reduction document:
http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf
The only difference is that, I am using the max function instead of summing the two values.

Thank you, it worked. :-)