How to ensure all threads end except the 1st thread?

I have been trying to write an algorithm for calculating covariance. I want the 1st thread of the 1st block of the grid(i ==0) to end by dividing the output by the (size-1). However, I can’t ensure that the last thread to end is the thread where i==0, which causes the result to be different from the actual result .

Any Idea how to ensure that the last thread to end is the thread (i==0)? Or is there another way to carry out the division?

here is my kernel

__global__ void cuCovariance(int2 *in, unsigned int size , int2 mean, int* out)

{

	unsigned int tid=threadIdx.x;

	unsigned int i=16*(blockIdx.x*blockDim.x+threadIdx.x);

	extern __shared__ int xy[];

	xy[tid]=0;

	for(int j=0;j<16;j++)

		xy[tid]+=(i<size)?((in[i+j].x-mean.x)*(in[i+j].y-mean.y)):0;

	__syncthreads();

	for(int s=(blockDim.x/2);s>0;s/=2)

	{

		if(tid<s)

			xy[tid]=xy[tid]+xy[tid+s];

		__syncthreads();

	}

	if(tid==0)

		atomicAdd(out,xy[tid]);

}

I tried adding if(i==0) (*out)/=(size-1); after the atomic add.

but it seems that the thread (i==0) performs a division before the rest of the threads get to do an atomic add.

Maybe im missing something, but if you have 1 division to do, why not do it on the cpu? End the kernel before the division is to be carried, copy the “output” instead of the end result on the CPU then do the division.

hmm that is what I thought of too, but then I will have to do a memcpy to the host 1st. Just wondering whether I could perform the step on the device side. If not, I think I’ll have to just make do with ending the last step on CPU…
My actual intention is to do all the steps in the kernel and time the kernel without the memcpy to and from the device. (as memcpy takes up a lot of time)

CODE
global void cuCovariance(int2 in, unsigned int size , int2 mean, int out)
{
unsigned int tid=threadIdx.x;
unsigned int i=16*(blockIdx.x*blockDim.x+threadIdx.x);

extern __shared__ int xy[];
xy[tid]=0;
for(int j=0;j<16;j++)
    xy[tid]+=(i<size)?((in[i+j].x-mean.x)*(in[i+j].y-mean.y)):0;

__syncthreads();

for(int s=(blockDim.x/2);s>0;s>>=1)
{
    if(tid<s)
        xy[tid]=xy[tid]+xy[tid+s];
    __syncthreads();
}

if(tid==0)
    out[tid] = xy[tid]

}

??

Put a __syncthreads() after the atomicAdd. That way, all threads will complete the atomicAdd function and THEN the first thread of the first block can enter the if statement to divide. Oh, and it needs to be ‘if(i == 0)’ and not ‘if(tid == 0)’. Using ‘if(tid == 0)’ will result in multiple threads (the first thread of EACH block) executing the division, and not just the one thread.

Tried __syncthreads alr. as atomicAdd() ensures that the addition operation is definitely carried out, the threads are not really syncronised in that step already, hence I think that is the main reason why syncthreads doesnt work properly in that case.

for kyzhao,
wouldn’t

if(tid==0)
out[tid] = xy[tid]

cause conflicts when writing to out[0]?