Problems with __syncthreads()

moises1982 · May 4, 2013, 6:01pm

Hi there.

I have a complex code where I need to use several times __syncthreads() with a block of 64 threads (2 warps). But it seems that it doesn’t work some times. And the more curious thing is that all the threads reach the barrier except one (thread 0) and I think that this is not possible because all threads of a warp should execute the same instruction.

I have tried to simplify the code and reproduce the error but when I delete any part of the code it run correctly. So I am going to write here only a part of the code, but it doesn’t fail if it is alone (in my code it is inside loops and it’s called from a global function):

reduction array is in shared memory
result is a float in shared memory
signalsA, signalsB, dataA, dataB are in global memory
S1,S2,t1 and t2 arealso in shared memory
and the other arguments are given by value

__device__ inline void func(int id,float *result,double *signalsA,double *signalsB,const double *dataA,const double *dataB,double *reduction,float S1,float* S2,bool con,float t1,float* t2,int m1,int m2,int threadsBlock,int n2,int n2,int HS){

	double a;
	double b;
	double c;
	float H= 1.0/(float)HS;

	float* count = result;

	if (!con){
		reduction[id]=0;
		for(int i=0; i<m1; i++){
			a=0;
			int dir;
			for (int m=0; m<HR; m++){
				dir= m*n1 +id+i*threadsBlock;
				a= a+(H*signalsA[dir]/S2[m]);
			}
	
			dir=id+i*threadsBlock;
			a=S1*a;
			a=dataA[dir]-a;
			reduction[id]+=a*a;
		}
		
		__syncthreads();
		for(unsigned int s=threadsBlock>>1; s>0; s>>=1) {
        		if (id<s){
            			reduction[id] = reduction[id] + reduction[id+s];
       	 		}
			__syncthreads();
		}
	
		if(id==0){
    			b=0.5*n1*log(0.5*reduction[0]); 
			c=0;
		}

		*count=0;
		__syncthreads();
		atomicAdd(count, 1);
		__syncthreads();
		printf("CHECK: %i...%.20f\n",id,*count);

		for (int m=0; m<HR; m++){
			reduction[id]=0;
			for(int i=0; i<m2; i++){
				int dir=m*n2+id+i*threadsBlock;
				a=dataB[dir]-signalsB[dir];
				reduction[id]+=a*a;
			}
		
			__syncthreads();
			for(unsigned int s=threadsBlock>>1; s>0; s>>=1) {
				
        			if (id<s){
            				reduction[id] = reduction[id] + reduction[id+s];
       	 			}
        			__syncthreads();
			}	
		
			c+=log(0.5*reduction[0]);
		}
		if(id==0){
			c*=0.5*n2; 
    			*result=b+c; 
		}	
	}else{
		reduction[id]=0;
		for(int i=0; i<m1; i++){
			pred=0;
			int dir;
			for (int m=0; m<HRsize; m++){
				dir=m*n1+id+i*threadsBlock;
				a= a+(H*signalsA[dir]/S2[m]);
			}

			dir=id+i*threadsBlock;
			pred=S0LR*pred;
			reduction[id]+=log(dataA[dir])-0.5*t1*(dataA[dir]*dataA[dir]+a*a)+fun2(t1*dataA[dir]*a);
		}
		
		__syncthreads();
		for(unsigned int s=threadsBlock>>1; s>0; s>>=1) {
        		if (id<s){
            			reduction[id] = reduction[id] + reduction[id+s];
       	 		}
			__syncthreads();
		}
	
    		if(id==0){
    			b=-n1*log(t1);
			b-=reduction[0]; 
			likHR=0;
		}
		   
		for (int m=0; m<HR; m++){
			reduction[id]=0;
			for(int i=0; i<m2; i++){
				int dir=m*n2+id+i*threadsBlock;
				a = dataB[dir]*dataB[dir]+signalsB[dir]*signalsB[dir];
				reduction[id]+=log(dataB[dir])-0.5*t2[m]*a+fun2(t2[m]*dataB[dir]*signalsB[dir]);
			}
			__syncthreads();
			for(unsigned int s=threadsBlock>>1; s>0; s>>=1) {
        			if (id<s){
            				reduction[id] = reduction[id] + reduction[id+s];
       	 			}
        			__syncthreads();
			}	
			if(id==0){
				c+=n2*log(t2[m]);
				c+=reduction[0];
			}
		}
		if(id==0){
			*result=b-c; 
		}
	}
}

moises1982 · May 4, 2013, 6:07pm

When I run it, sometimes all therads (64/64) reach all the barriers, but other times not. As you can see I did a check in the middle of the code and I count the number of threads that reach a barrier. Each thread print its id and this number. This is the output in some cases:

CHECK: 32...63.00000000000000000000
CHECK: 33...63.00000000000000000000
CHECK: 34...63.00000000000000000000
CHECK: 35...63.00000000000000000000
CHECK: 36...63.00000000000000000000
CHECK: 37...63.00000000000000000000
CHECK: 38...63.00000000000000000000
CHECK: 39...63.00000000000000000000
CHECK: 40...63.00000000000000000000
CHECK: 41...63.00000000000000000000
CHECK: 42...63.00000000000000000000
CHECK: 43...63.00000000000000000000
CHECK: 44...63.00000000000000000000
CHECK: 45...63.00000000000000000000
CHECK: 46...63.00000000000000000000
CHECK: 47...63.00000000000000000000
CHECK: 48...63.00000000000000000000
CHECK: 49...63.00000000000000000000
CHECK: 50...63.00000000000000000000
CHECK: 51...63.00000000000000000000
CHECK: 52...63.00000000000000000000
CHECK: 53...63.00000000000000000000
CHECK: 54...63.00000000000000000000
CHECK: 55...63.00000000000000000000
CHECK: 56...63.00000000000000000000
CHECK: 57...63.00000000000000000000
CHECK: 58...63.00000000000000000000
CHECK: 59...63.00000000000000000000
CHECK: 60...63.00000000000000000000
CHECK: 61...63.00000000000000000000
CHECK: 62...63.00000000000000000000
CHECK: 63...63.00000000000000000000
CHECK: 1...63.00000000000000000000
CHECK: 2...63.00000000000000000000
CHECK: 3...63.00000000000000000000
CHECK: 4...63.00000000000000000000
CHECK: 5...63.00000000000000000000
CHECK: 6...63.00000000000000000000
CHECK: 7...63.00000000000000000000
CHECK: 8...63.00000000000000000000
CHECK: 9...63.00000000000000000000
CHECK: 10...63.00000000000000000000
CHECK: 11...63.00000000000000000000
CHECK: 12...63.00000000000000000000
CHECK: 13...63.00000000000000000000
CHECK: 14...63.00000000000000000000
CHECK: 15...63.00000000000000000000
CHECK: 16...63.00000000000000000000
CHECK: 17...63.00000000000000000000
CHECK: 18...63.00000000000000000000
CHECK: 19...63.00000000000000000000
CHECK: 20...63.00000000000000000000
CHECK: 21...63.00000000000000000000
CHECK: 22...63.00000000000000000000
CHECK: 23...63.00000000000000000000
CHECK: 24...63.00000000000000000000
CHECK: 25...63.00000000000000000000
CHECK: 26...63.00000000000000000000
CHECK: 27...63.00000000000000000000
CHECK: 28...63.00000000000000000000
CHECK: 29...63.00000000000000000000
CHECK: 30...63.00000000000000000000
CHECK: 31...63.00000000000000000000

So, while the thread 0 it is computing a log operation, the other threads don’t wait for him. But at least the threads of the same warp should wait him.

I don’t understand what is happening.
Please, any idea or explanation ?

Thank you

moises1982 · May 4, 2013, 6:12pm

Also I have checked that the con boolean is always false for all the threads (so all oh them are executing the first part)…and all threads are reaching the barriers before the check point. So the problem happens at this point.

Topic		Replies	Views
Syncthreads and Stalling Kernels CUDA Programming and Performance	16	4141	August 26, 2010
cuda syncthreads fail CUDA Programming and Performance	7	3877	February 22, 2013
why I do not have a problem with __syncthreads ? CUDA Programming and Performance	10	7274	May 26, 2010
Does __syncthreads not work across multiple warps? CUDA Programming and Performance	9	3469	April 30, 2014
Semantics of __syncthreads CUDA Programming and Performance	18	18291	January 2, 2008
Problem with __syncthreads() It does not work for threads > 64 CUDA Programming and Performance	4	962	April 11, 2012
__syncthreads screwes calculation CUDA Programming and Performance	2	3422	November 22, 2007
__syncthreads thread syncronization CUDA Programming and Performance	7	18781	October 27, 2009
shared memory and __syncthreads() one writer, n readers CUDA Programming and Performance	5	3034	August 25, 2008
problem with __syncthreads(); CUDA Programming and Performance	1	1695	December 15, 2011

Problems with __syncthreads()

Related topics