running on GT 440 (Fermi).

I have 384 threads per block ( dim3 threads (384, 1, 1)); I am batching my data into 8 groups of 48 values (to speed up other aspects of the routine). Essentially performing a matrix (48X48) times a vector (48X1). The code snippet WITHOUT the __syncthreads does not deliver correct results; requires each of the __syncthread calls to deliver correct results.

The code assumes that each group of 48 as defined by tx is one matrix row times column vector. I break up the additions into 3 blocks of 16; add up the sums of each group of 16 at the end. Each grouping of 16 threads should be in the same warp – (0 to 15) (16 to 32) (33 so I should not need to __syncthread for these entries as each line of the code executes in each thread simultaneously within the warp, right?

```
__shared__ float facc [2048];
int tx = threadIdx.x;
facc[tx] = x0[tx] * Cij[ind];
__syncthreads();
facc [tx] += facc[tx + 1]; // 0+1; 2+3; ...
__syncthreads();
facc [tx] += facc[tx + 2]; // 0 + 2; 4 + 6; ...
__syncthreads();
facc [tx] += facc[tx + 4]; // 0 + 4; 8 + 12;
__syncthreads();
facc [tx] += facc[tx + 8]; // 0 + 8;
__syncthreads (); // wait here for all half warps to do this calc
if ( tx modulo 48 is zero ) then
float answer = facc[tx] + facc[tx + 16] + facc[tx + 32]; // store result
```

As i noted above, without the sync call the answer is wrong. but using the syncthreads calls is slowing the module by a factor of 2! so i want to eliminate the sync calls.

Can someone explain why the sync is needed if the computation is all internal to the same warp?

thanks

JM