__syncthreads() issue

Hello,

I am tracking a divergence between the gpu and cpu version of (a first and naive implementation of) an optimisation algorithm and I am having an issue with __syncthreads().

d_Population is a matrix in global device memory used to store NBPOP NBDIM-dimensional points. The “MutationCrossoverReplacement” kernel below calculates new potential points and updates the matrix if need be (all threads are writing to this global device variable).

The kernel is meant to be called in a loop in the host code, after everything has been initialized. After 1 iteration, there is already a divergence though.

It’s most probably a synchronisation issue: some threads are already writing to the matrix while others are still reading already updated values etc… When splitting up the kernel in 2 (leaving only the replacement part in a second kernel, and using an intermediate d_NewPopulation matrix to store new points), the problem disappears.

What surprises me is that it also happens when I run the kernel on just 1 block x 16 threads. Should not __synthreads() eliminate the problem???

Can someone please explain what is going on?

__device__ __constant__ float d_LowBound[NBDIM];

__device__ __constant__ float d_HighBound[NBDIM];

__device__ __constant__ float d_NeighbourhoodSize[NBDIM];

__device__ __constant__ float d_F,d_CR;

__device__ float d_Fitnesses[NBPOP];

__device__ float d_NewFitnesses[NBPOP];

__device__ float d_Population[NBPOP*NBDIM];

__device__ float d_NewPopulation[NBPOP*NBDIM];

template<class F>

__global__ void MutationCrossoverReplacement(CUDARand* d_Rnd,int count)

{	

	int tid = (blockIdx.y * gridDim.x + blockIdx.x)  * blockDim.x + threadIdx.x;	

	float l_NewPoint[NBDIM];

	//select sample

	int r1,r2,r3;

	SelectSamples(&d_Rnd[tid],tid,&r1,&r2,&r3);	//draws 3 different random ints in [0,NBPOP[ and different from tid	

		

	//mutation

	for(int j=0;j<NBDIM;j++)

	{

		float newcoord=d_Population[r1*NBDIM+j]+(d_Population[r2*NBDIM+j]-d_Population[r3*NBDIM+j])*d_F;

		l_NewPoint[j]=min(max(newcoord,d_LowBound[j]),d_HighBound[j]);			

	}

	//crossover

	int l=(int)(d_Rnd[tid].uniforme()*NBDIM);	//draws a random int in [0,NBDIM[

	for(int j=0;j<NBDIM;j++)

	{

		if(d_Rnd[tid].uniforme()>d_CR && j!=l)

			l_NewPoint[j]=d_Population[tid*NBDIM+j];			

	}

	

	//evaluation

	float newfitness=F()(&l_NewPoint[0],NBDIM);	//evaluation of a cost function

	__syncthreads();

	//replacement

	if(newfitness<d_Fitnesses[tid])

	{

		for(int j=0;j<NBDIM;j++)

			d_Population[tid*NBDIM+j]=l_NewPoint[j];

		

		d_Fitnesses[tid]=newfitness;

	}

		

	__syncthreads();	

}

I thought this was caused by thread divergence so I replaced:

__syncthreads();

//replacement

if(newfitness<d_Fitnesses[tid])

{

	for(int j=0;j<NBDIM;j++)

		d_Population[tid*NBDIM+j]=l_NewPoint[j];

	d_Fitnesses[tid]=newfitness;

}

by:

__syncthreads();

for(int j=0;j<NBDIM;j++)

	d_Population[tid*NBDIM+j]=newfitness<d_Fitnesses[tid]?l_NewPoint[j]:d_Population[tid*NBDIM+j];		

d_Fitnesses[tid]=min(newfitness,d_Fitnesses[tid]);

But unfortunately that does not change anything…

__syncthreads() only synchronizes the threads within a block, not between blocks. To synchronize between blocks, multiple kernel invocations (as you did) are the only option.

Thanks for your answer tera. I got that. But I also have the problem when I launch the kernel on 1 block of 16 threads.

Ah, right, I missed that point in your initial post. Just to make sure, does your fitness function have __syncthreads() anywhere in it?

Nope, it looks like so (for example):

class GriewankFunctor

{

public: 

        __host__ __device__ float operator()(float* x,int nbdim) 

        {

			float sum=0.0f;

			float prod=1.0f;

			for (int i=0;i<nbdim;i++)

			{

			sum+=(x[i])*(x[i]);

			prod*=cos((x[i])/sqrt(float(i+1)));

			}

			return sum/4000.0f-prod+1;

       } 

};

This is an inherent limitation in CUDA. Currently there is no way of ensuring data consistency across threads when accessing global memory other than terminating the current kernel execution.

Refer to book “Programming Massively Parallel Processors” chapter 5.

Thanks. I don’t have access to the book at the moment. Are you saying “there is no way of ensuring data consistency across threads when accessing global memory” as opposed to when accessing shared memory?

Yes. Only way is to invoke a 2nd kernel.

I have seen this statement about shared vs. global memory access with __syncthreads() on another forum as well so I tried to make use of shared memory, replacing the above kernel with:

template<class F>

__global__ void MutationCrossoverReplacementShared(CUDARand* d_Rnd,int count)

{

		int tid = (blockIdx.y * gridDim.x + blockIdx.x)  * blockDim.x + threadIdx.x;

		//loading data in shared memory

		__shared__ float s_Population[NBPOPPERISLAND*NBDIM];

		for(int i=0;i<NBDIM;i++)

		s_Population[threadIdx.x*NBDIM+i]=d_Population[(blockIdx.x*NBPOPPERISLAND+threadIdx.x)*NBDIM+i];

		__shared__ float s_Fitnesses[NBPOPPERISLAND];

		s_Fitnesses[threadIdx.x]=d_Fitnesses[blockIdx.x*NBPOPPERISLAND+threadIdx.x];

		__syncthreads();

		float l_NewPoint[NBDIM];

		//select sample

		int r1,r2,r3;

                SelectSamples(&d_Rnd[tid],tid,&r1,&r2,&r3);

		

	

		

		//mutation

		for(int j=0;j<NBDIM;j++)

		{

			float newcoord=s_Population[r1*NBDIM+j]+(s_Population[r2*NBDIM+j]-s_Population[r3*NBDIM+j])*d_F;		

			l_NewPoint[j]=min(max(newcoord,d_LowBound[j]),d_HighBound[j]);						

		}		

		

		//crossover

		int l=(int)(d_Rnd[tid].uniforme()*NBDIM);

		for(int j=0;j<NBDIM;j++)

		{

			if(d_Rnd[tid].uniforme()>d_CR && j!=l)

				l_NewPoint[j]=s_Population[tid*NBDIM+j];

								

		}		

		

		//evaluation

		float newfitness=F()(&l_NewPoint[0],NBDIM);

					

		__syncthreads();

		//replacement

		if(newfitness<s_Fitnesses[tid])

		{

			for(int j=0;j<NBDIM;j++)

				s_Population[tid*NBDIM+j]=l_NewPoint[j];

					

			s_Fitnesses[tid]=newfitness;			

		}

		__syncthreads();

		

		

		//setting data back to global memory

		for(int i=0;i<NBDIM;i++)

		d_Population[(blockIdx.x*NBPOPPERISLAND+threadIdx.x)*NBDIM+i]=s_Population[threadIdx.x*NBDIM+i];

		

		d_Fitnesses[blockIdx.x*NBPOPPERISLAND+threadIdx.x]=s_Fitnesses[threadIdx.x];

	

}

Unfortunately, the result is exactly the same…

The problem with calling two kernels is that I won’t be able to make a sensible use of shared memory.

Ultimately, I would want to use shared memory and loop inside the kernel instead of doing a loop of kernels, synchronizing the blocks with global memory only once every N iterations.

Looks like the example in Cuda_C_Programming_Guide 5.4.3 is very similar to this.
Would it be a case for using the “volatile” keyword? Will give it a try tonight.