Problems with __syncthreads()

Hi there.

I have a complex code where I need to use several times __syncthreads() with a block of 64 threads (2 warps). But it seems that it doesn’t work some times. And the more curious thing is that all the threads reach the barrier except one (thread 0) and I think that this is not possible because all threads of a warp should execute the same instruction.

I have tried to simplify the code and reproduce the error but when I delete any part of the code it run correctly. So I am going to write here only a part of the code, but it doesn’t fail if it is alone (in my code it is inside loops and it’s called from a global function):

reduction array is in shared memory
result is a float in shared memory
signalsA, signalsB, dataA, dataB are in global memory
S1,S2,t1 and t2 arealso in shared memory
and the other arguments are given by value

__device__ inline void func(int id,float *result,double *signalsA,double *signalsB,const double *dataA,const double *dataB,double *reduction,float S1,float* S2,bool con,float t1,float* t2,int m1,int m2,int threadsBlock,int n2,int n2,int HS){

	double a;
	double b;
	double c;
	float H= 1.0/(float)HS;

	float* count = result;

	if (!con){
		reduction[id]=0;
		for(int i=0; i<m1; i++){
			a=0;
			int dir;
			for (int m=0; m<HR; m++){
				dir= m*n1 +id+i*threadsBlock;
				a= a+(H*signalsA[dir]/S2[m]);
			}
	
			dir=id+i*threadsBlock;
			a=S1*a;
			a=dataA[dir]-a;
			reduction[id]+=a*a;
		}
		
		__syncthreads();
		for(unsigned int s=threadsBlock>>1; s>0; s>>=1) {
        		if (id<s){
            			reduction[id] = reduction[id] + reduction[id+s];
       	 		}
			__syncthreads();
		}
	
		if(id==0){
    			b=0.5*n1*log(0.5*reduction[0]); 
			c=0;
		}

		*count=0;
		__syncthreads();
		atomicAdd(count, 1);
		__syncthreads();
		printf("CHECK: %i...%.20f\n",id,*count);

		for (int m=0; m<HR; m++){
			reduction[id]=0;
			for(int i=0; i<m2; i++){
				int dir=m*n2+id+i*threadsBlock;
				a=dataB[dir]-signalsB[dir];
				reduction[id]+=a*a;
			}
		
			__syncthreads();
			for(unsigned int s=threadsBlock>>1; s>0; s>>=1) {
				
        			if (id<s){
            				reduction[id] = reduction[id] + reduction[id+s];
       	 			}
        			__syncthreads();
			}	
		
			c+=log(0.5*reduction[0]);
		}
		if(id==0){
			c*=0.5*n2; 
    			*result=b+c; 
		}	
	}else{
		reduction[id]=0;
		for(int i=0; i<m1; i++){
			pred=0;
			int dir;
			for (int m=0; m<HRsize; m++){
				dir=m*n1+id+i*threadsBlock;
				a= a+(H*signalsA[dir]/S2[m]);
			}

			dir=id+i*threadsBlock;
			pred=S0LR*pred;
			reduction[id]+=log(dataA[dir])-0.5*t1*(dataA[dir]*dataA[dir]+a*a)+fun2(t1*dataA[dir]*a);
		}
		
		__syncthreads();
		for(unsigned int s=threadsBlock>>1; s>0; s>>=1) {
        		if (id<s){
            			reduction[id] = reduction[id] + reduction[id+s];
       	 		}
			__syncthreads();
		}
	
    		if(id==0){
    			b=-n1*log(t1);
			b-=reduction[0]; 
			likHR=0;
		}
		   
		for (int m=0; m<HR; m++){
			reduction[id]=0;
			for(int i=0; i<m2; i++){
				int dir=m*n2+id+i*threadsBlock;
				a = dataB[dir]*dataB[dir]+signalsB[dir]*signalsB[dir];
				reduction[id]+=log(dataB[dir])-0.5*t2[m]*a+fun2(t2[m]*dataB[dir]*signalsB[dir]);
			}
			__syncthreads();
			for(unsigned int s=threadsBlock>>1; s>0; s>>=1) {
        			if (id<s){
            				reduction[id] = reduction[id] + reduction[id+s];
       	 			}
        			__syncthreads();
			}	
			if(id==0){
				c+=n2*log(t2[m]);
				c+=reduction[0];
			}
		}
		if(id==0){
			*result=b-c; 
		}
	}
}

When I run it, sometimes all therads (64/64) reach all the barriers, but other times not. As you can see I did a check in the middle of the code and I count the number of threads that reach a barrier. Each thread print its id and this number. This is the output in some cases:

CHECK: 32...63.00000000000000000000
CHECK: 33...63.00000000000000000000
CHECK: 34...63.00000000000000000000
CHECK: 35...63.00000000000000000000
CHECK: 36...63.00000000000000000000
CHECK: 37...63.00000000000000000000
CHECK: 38...63.00000000000000000000
CHECK: 39...63.00000000000000000000
CHECK: 40...63.00000000000000000000
CHECK: 41...63.00000000000000000000
CHECK: 42...63.00000000000000000000
CHECK: 43...63.00000000000000000000
CHECK: 44...63.00000000000000000000
CHECK: 45...63.00000000000000000000
CHECK: 46...63.00000000000000000000
CHECK: 47...63.00000000000000000000
CHECK: 48...63.00000000000000000000
CHECK: 49...63.00000000000000000000
CHECK: 50...63.00000000000000000000
CHECK: 51...63.00000000000000000000
CHECK: 52...63.00000000000000000000
CHECK: 53...63.00000000000000000000
CHECK: 54...63.00000000000000000000
CHECK: 55...63.00000000000000000000
CHECK: 56...63.00000000000000000000
CHECK: 57...63.00000000000000000000
CHECK: 58...63.00000000000000000000
CHECK: 59...63.00000000000000000000
CHECK: 60...63.00000000000000000000
CHECK: 61...63.00000000000000000000
CHECK: 62...63.00000000000000000000
CHECK: 63...63.00000000000000000000
CHECK: 1...63.00000000000000000000
CHECK: 2...63.00000000000000000000
CHECK: 3...63.00000000000000000000
CHECK: 4...63.00000000000000000000
CHECK: 5...63.00000000000000000000
CHECK: 6...63.00000000000000000000
CHECK: 7...63.00000000000000000000
CHECK: 8...63.00000000000000000000
CHECK: 9...63.00000000000000000000
CHECK: 10...63.00000000000000000000
CHECK: 11...63.00000000000000000000
CHECK: 12...63.00000000000000000000
CHECK: 13...63.00000000000000000000
CHECK: 14...63.00000000000000000000
CHECK: 15...63.00000000000000000000
CHECK: 16...63.00000000000000000000
CHECK: 17...63.00000000000000000000
CHECK: 18...63.00000000000000000000
CHECK: 19...63.00000000000000000000
CHECK: 20...63.00000000000000000000
CHECK: 21...63.00000000000000000000
CHECK: 22...63.00000000000000000000
CHECK: 23...63.00000000000000000000
CHECK: 24...63.00000000000000000000
CHECK: 25...63.00000000000000000000
CHECK: 26...63.00000000000000000000
CHECK: 27...63.00000000000000000000
CHECK: 28...63.00000000000000000000
CHECK: 29...63.00000000000000000000
CHECK: 30...63.00000000000000000000
CHECK: 31...63.00000000000000000000

So, while the thread 0 it is computing a log operation, the other threads don’t wait for him. But at least the threads of the same warp should wait him.

I don’t understand what is happening.
Please, any idea or explanation ?

Thank you

Also I have checked that the con boolean is always false for all the threads (so all oh them are executing the first part)…and all threads are reaching the barriers before the check point. So the problem happens at this point.