I am tracking a divergence between the gpu and cpu version of (a first and naive implementation of) an optimisation algorithm and I am having an issue with __syncthreads().
d_Population is a matrix in global device memory used to store NBPOP NBDIM-dimensional points. The “MutationCrossoverReplacement” kernel below calculates new potential points and updates the matrix if need be (all threads are writing to this global device variable).
The kernel is meant to be called in a loop in the host code, after everything has been initialized. After 1 iteration, there is already a divergence though.
It’s most probably a synchronisation issue: some threads are already writing to the matrix while others are still reading already updated values etc… When splitting up the kernel in 2 (leaving only the replacement part in a second kernel, and using an intermediate d_NewPopulation matrix to store new points), the problem disappears.
What surprises me is that it also happens when I run the kernel on just 1 block x 16 threads. Should not __synthreads() eliminate the problem???
Can someone please explain what is going on?
__device__ __constant__ float d_LowBound[NBDIM];
__device__ __constant__ float d_HighBound[NBDIM];
__device__ __constant__ float d_NeighbourhoodSize[NBDIM];
__device__ __constant__ float d_F,d_CR;
__device__ float d_Fitnesses[NBPOP];
__device__ float d_NewFitnesses[NBPOP];
__device__ float d_Population[NBPOP*NBDIM];
__device__ float d_NewPopulation[NBPOP*NBDIM];
template<class F>
__global__ void MutationCrossoverReplacement(CUDARand* d_Rnd,int count)
{
int tid = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;
float l_NewPoint[NBDIM];
//select sample
int r1,r2,r3;
SelectSamples(&d_Rnd[tid],tid,&r1,&r2,&r3); //draws 3 different random ints in [0,NBPOP[ and different from tid
//mutation
for(int j=0;j<NBDIM;j++)
{
float newcoord=d_Population[r1*NBDIM+j]+(d_Population[r2*NBDIM+j]-d_Population[r3*NBDIM+j])*d_F;
l_NewPoint[j]=min(max(newcoord,d_LowBound[j]),d_HighBound[j]);
}
//crossover
int l=(int)(d_Rnd[tid].uniforme()*NBDIM); //draws a random int in [0,NBDIM[
for(int j=0;j<NBDIM;j++)
{
if(d_Rnd[tid].uniforme()>d_CR && j!=l)
l_NewPoint[j]=d_Population[tid*NBDIM+j];
}
//evaluation
float newfitness=F()(&l_NewPoint[0],NBDIM); //evaluation of a cost function
__syncthreads();
//replacement
if(newfitness<d_Fitnesses[tid])
{
for(int j=0;j<NBDIM;j++)
d_Population[tid*NBDIM+j]=l_NewPoint[j];
d_Fitnesses[tid]=newfitness;
}
__syncthreads();
}
__syncthreads() only synchronizes the threads within a block, not between blocks. To synchronize between blocks, multiple kernel invocations (as you did) are the only option.
This is an inherent limitation in CUDA. Currently there is no way of ensuring data consistency across threads when accessing global memory other than terminating the current kernel execution.
Refer to book “Programming Massively Parallel Processors” chapter 5.
Thanks. I don’t have access to the book at the moment. Are you saying “there is no way of ensuring data consistency across threads when accessing global memory” as opposed to when accessing shared memory?
I have seen this statement about shared vs. global memory access with __syncthreads() on another forum as well so I tried to make use of shared memory, replacing the above kernel with:
template<class F>
__global__ void MutationCrossoverReplacementShared(CUDARand* d_Rnd,int count)
{
int tid = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;
//loading data in shared memory
__shared__ float s_Population[NBPOPPERISLAND*NBDIM];
for(int i=0;i<NBDIM;i++)
s_Population[threadIdx.x*NBDIM+i]=d_Population[(blockIdx.x*NBPOPPERISLAND+threadIdx.x)*NBDIM+i];
__shared__ float s_Fitnesses[NBPOPPERISLAND];
s_Fitnesses[threadIdx.x]=d_Fitnesses[blockIdx.x*NBPOPPERISLAND+threadIdx.x];
__syncthreads();
float l_NewPoint[NBDIM];
//select sample
int r1,r2,r3;
SelectSamples(&d_Rnd[tid],tid,&r1,&r2,&r3);
//mutation
for(int j=0;j<NBDIM;j++)
{
float newcoord=s_Population[r1*NBDIM+j]+(s_Population[r2*NBDIM+j]-s_Population[r3*NBDIM+j])*d_F;
l_NewPoint[j]=min(max(newcoord,d_LowBound[j]),d_HighBound[j]);
}
//crossover
int l=(int)(d_Rnd[tid].uniforme()*NBDIM);
for(int j=0;j<NBDIM;j++)
{
if(d_Rnd[tid].uniforme()>d_CR && j!=l)
l_NewPoint[j]=s_Population[tid*NBDIM+j];
}
//evaluation
float newfitness=F()(&l_NewPoint[0],NBDIM);
__syncthreads();
//replacement
if(newfitness<s_Fitnesses[tid])
{
for(int j=0;j<NBDIM;j++)
s_Population[tid*NBDIM+j]=l_NewPoint[j];
s_Fitnesses[tid]=newfitness;
}
__syncthreads();
//setting data back to global memory
for(int i=0;i<NBDIM;i++)
d_Population[(blockIdx.x*NBPOPPERISLAND+threadIdx.x)*NBDIM+i]=s_Population[threadIdx.x*NBDIM+i];
d_Fitnesses[blockIdx.x*NBPOPPERISLAND+threadIdx.x]=s_Fitnesses[threadIdx.x];
}
Unfortunately, the result is exactly the same…
The problem with calling two kernels is that I won’t be able to make a sensible use of shared memory.
Ultimately, I would want to use shared memory and loop inside the kernel instead of doing a loop of kernels, synchronizing the blocks with global memory only once every N iterations.
Looks like the example in Cuda_C_Programming_Guide 5.4.3 is very similar to this.
Would it be a case for using the “volatile” keyword? Will give it a try tonight.