Why does single warp need syncthreads?

running on GT 440 (Fermi).
I have 384 threads per block ( dim3 threads (384, 1, 1)); I am batching my data into 8 groups of 48 values (to speed up other aspects of the routine). Essentially performing a matrix (48X48) times a vector (48X1). The code snippet WITHOUT the __syncthreads does not deliver correct results; requires each of the __syncthread calls to deliver correct results.

The code assumes that each group of 48 as defined by tx is one matrix row times column vector. I break up the additions into 3 blocks of 16; add up the sums of each group of 16 at the end. Each grouping of 16 threads should be in the same warp – (0 to 15) (16 to 32) (33 so I should not need to __syncthread for these entries as each line of the code executes in each thread simultaneously within the warp, right?

__shared__ float facc [2048];	
    int tx  = threadIdx.x;   
	facc[tx] = x0[tx] * Cij[ind];				
	__syncthreads(); 
     facc [tx] += facc[tx + 1];  // 0+1; 2+3; ...
	__syncthreads();
	facc [tx] += facc[tx + 2];  // 0 + 2; 4 + 6; ...
	__syncthreads();
	facc [tx] += facc[tx + 4];  // 0 + 4; 8 + 12;
	__syncthreads();
	facc [tx] += facc[tx + 8];  // 0 + 8;
	__syncthreads ();	// wait here for all half warps to do this calc
    if ( tx modulo 48 is zero )	then
	float answer = facc[tx] + facc[tx + 16] + facc[tx + 32];  // store result  

As i noted above, without the sync call the answer is wrong. but using the syncthreads calls is slowing the module by a factor of 2! so i want to eliminate the sync calls.

Can someone explain why the sync is needed if the computation is all internal to the same warp?

thanks

JM

You need to declare [font=“Courier New”]facc[/font] as volatile if you drop the __syncthreads(), or the compiler won’t reload the changed values from shared memory to registers.

Tera,

super – many thanks!! volatile was one of those things i didnt get on the first reading of the manual.