Hi,

I have one simple question, can I take advantage of the fact that threads in a warp execute simultaneosly and so discard __syncthreads() call for a parallel scan algorithm , like in my case , a sum of an array?. I won’t get a race condition or will I ? My hardware is GTX280, so it has warp size of 32.

This is the kernel example:

```
extern "C" __global__ void reduce_float_32_elts(float *dst_data,float *src_data,uint num_elts) {
__shared__ __device__ float tmp[32];
uint dst_idx,me;
dst_idx=blockIdx.x*(blockDim.x*blockDim.y)+threadIdx.y*blockDim.x;
me=dst_idx+threadIdx.x;
tmp[threadIdx.x]=0.0f;
if (me<num_elts) {
tmp[threadIdx.x]=src_data[me];
if (threadIdx.x<16) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+16];
}
if (threadIdx.x<8) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+8];
}
if (threadIdx.x<4) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+4];
}
if (threadIdx.x<2) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+2];
}
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+1];
}
if (threadIdx.x==0) dst_data[dst_idx]=tmp[0];
}
```

It should be faster to call many kernels like this from the CPU until the sum is complete, than do a complex kernel with __syncthreads() or more "if"s to handle special conditions. am I right?

Any comment will be very appreciated

Regards