Synchronization issues, what is the best way around this problem? A particular scenario which requir

Hello everyone,

I am having some issues with synchronization issues which you may be able to solve. I am trying to think of a means to ensure synchronization, while reducing programming time, errors etc.

First of all, the code I have written already includes a broken up kernel which incorporates synchronization for each timestep:

for ( n=(initFrom+1); n<=(initFrom+Nmax); n++ ) {

	printf("n=%i\n",n);

	/*************CUDA Start***************/

	cudaFDTD<<<dimGrid,dimBlock>>>(cudaStruct_ptr,n,0,0);

	ErrorCheckDevice("When Calling the kernel Function (part 0)");

	cudaFDTD<<<dimGrid,dimBlock>>>(cudaStruct_ptr,n,1,0);

	ErrorCheckDevice("When Calling the kernel Function (part 1)");

	if (gearRadius >0)

	{

	  for (minorStep = 0; minorStep < gearRatio; minorStep++) {

	cudaFDTD<<<dimGrid,dimBlock>>>(cudaStruct_ptr,n,2,minorStep);

		ErrorCheckDevice("When Calling the kernel Function (part 2)");

	cudaFDTD<<<dimGrid,dimBlock>>>(cudaStruct_ptr,n,3,minorStep);

		ErrorCheckDevice("When Calling the kernel Function (part 3)");

	  }

	}

	

	cudaFDTD<<<dimGrid,dimBlock>>>(cudaStruct_ptr,n,4,0);

	ErrorCheckDevice("When Calling the kernel Function (part 4)");

	cudaThreadSynchronize();

.

.

.

so the issue involves a device function included within cudaFDTD(…) where:

update_pnplus1Old(param,index);

#ifndef __DEVICE_EMULATION__

  __syncthreads();

  #endif

//#ifdef _OPENMP

  //#pragma omp parallel for default(shared) private(k,j,i,r,index,opk,pnjplus2,pnjplus1,z1,z2,z3,z4,z5,z6,z7

,z8,z9,z10,z11,z12,z13,z14,lastSeg,c,q1,q2,q3,q4,q5,q6r,q6th,

q6z,q7)

  //#endif

  //for ( k=Kfirst; k<=Klast; k++ ) {

  //for ( j=Jinner; j<=Jlimit; j++ ) {

  //for ( i=3; i<=Imax-2; i++ ) {

  if( i >= 2 &&				// >I Bound

	  i <= param->Imax-3 &&	// <I Bound

	  j >= Jinner-1 &&		 // >J Bound

	  j <= Jlimit-1 ) {		// <J Bound

		  

	/*if (61==i && j==24 && k==11)

	  {

	  cuPrintf("before all mess : %f\n", param->pnplus1[index]);

	  }

	*/

	  

	//		index=cudaIndex(i,j,k);

	r=(IFP)((j+0.5)*param->dr); 

		

	// Calculate the spatial functions.

	if (j==0) {

	  opk=(k+param->Kmax/2)%param->Kmax;

	  z1=-param->pn[index+2*jInc]+16*param->pn[index+jInc]-30*param->pn[index]+16*param->pn[cudaIndex(i,j,opk)]-param->pn[cudaIndex(i,j+1,opk)];

	  z2=-param->pn[index+2*jInc]+8*param->pn[index+jInc]-8*param->pn[cudaIndex(i,j,opk)]+param->pn[cudaIndex(i,j+1,opk)];

	  z8=z2;

	  z9=-param->rho[index+2*jInc]+8*param->rho[index+jInc]-8*param->rho[cudaIndex(i,j,opk)]+param->rho[cudaIndex(i,j+1,opk)];

	} else if (j==1) { 

	  opk=(k+param->Kmax/2)%param->Kmax;

	  z1=-param->pn[index+2*jInc]+16*param->pn[index+jInc]-30*param->pn[index]+16*param->pn[index-jInc]-param->pn[cudaIndex(i,j-1,opk)];

	  z2=-param->pn[index+2*jInc]+8*param->pn[index+jInc]-8*param->pn[index-jInc]+param->pn[cudaIndex(i,j-1,opk)];

	  z8=z2;

	  z9=-param->rho[index+2*jInc]+8*param->rho[index+jInc]-8*param->rho[index-jInc]+param->rho[cudaIndex(i,j-1,opk)];

	} else if (j==Jouter-1) { 

	  pnjplus2=(weight[0]*param->pnplus1Old[index+2*jInc]+weight[1]*param->pn[index+2*jInc]+

		weight[2]*param->pnminus1[index+2*jInc]+weight[3]*param->pnminus2[index+2*jInc])/6;

	  pnjplus1=(weight[0]*param->pnplus1Old[index+jInc]+weight[1]*param->pn[index+jInc]+

		weight[2]*param->pnminus1[index+jInc]+weight[3]*param->pnminus2[index+jInc])/6;

	  z1=-pnjplus2+16*pnjplus1-30*param->pn[index]+16*param->pn[index-jInc]-param->pn[index-2*jInc];

	  z2=-pnjplus2+8*pnjplus1-8*param->pn[index-jInc]+param->pn[index-2*jInc];

	  z8=z2;

	  z9=-param->rho[index+2*jInc]+8*param->rho[index+jInc]-8*param->rho[index-jInc]+param->rho[index-2*jInc];

	} else if (j==Jouter-2) { 

	  pnjplus2=(weight[0]*param->pnplus1Old[index+2*jInc]+weight[1]*param->pn[index+2*jInc]+

		weight[2]*param->pnminus1[index+2*jInc]+weight[3]*param->pnminus2[index+2*jInc])/6;

	  z1=-pnjplus2+16*param->pn[index+jInc]-30*param->pn[index]+16*param->pn[index-jInc]-param->pn[index-2*jInc];

	  z2=-pnjplus2+8*param->pn[index+jInc]-8*param->pn[index-jInc]+param->pn[index-2*jInc];

	  z8=z2;

	  z9=-param->rho[index+2*jInc]+8*param->rho[index+jInc]-8*param->rho[index-jInc]+param->rho[index-2*jInc];

	} else {

	  z1=-param->pn[index+2*jInc]+16*param->pn[index+jInc]-30*param->pn[index]+16*param->pn[index-jInc]-param->pn[index-2*jInc];

	  z2=-param->pn[index+2*jInc]+8*param->pn[index+jInc]-8*param->pn[index-jInc]+param->pn[index-2*jInc];

	  z8=z2;

	  z9=-param->rho[index+2*jInc]+8*param->rho[index+jInc]-8*param->rho[index-jInc]+param->rho[index-2*jInc];

	}

.

.

.

you’ll notice that i’m updating pnplus1Old at the start, but I can’t ensure that is fully updated by the time else if (j==Jouter-1) { … or else if (j==Jouter-2) {… since __syncthreads() can only ensure synchronization between threads in a block.

There are several ways I was thinking of fixing this without tearing up the kernel function into several more parts, but I want to choose a path which provides the best alternative without sacrificing much in terms of speed, and support for later generations of CUDA and openCL.

    The first alternative was to employ semaphores for each thread which ensures data integrity for pnplus1Old

    The second alternative is to use a single block of threads to compute the entire matrix, and providing adequate speed-up by making full use of shared memory (highly unlikely choice)

    I’ve also heard of people programming with respect to how thread assignments propagate through a matrix, but this is undesirable since it might break in later versions of CUDA if thread scheduling techniques were to change.

I was wondering if maybe there were any other alternatives I should look into, i’m not really experienced with this sort of problem and I’ve heard that implementing semaphores into the system wouldn’t be beneficial when compared to other ways of dealing with this problem.

thanks to anyone who responds with any piece of valuable information!

Ben