Wrong result after a certain number of iterations Execution doesn't give the same results as the

Hi all,

I’ve been wrapping my head around this problem, but still couldn’t figure out a way to solve this one out.

I have two kernels which are giving back result to each other.

In the first kernel BP1, tmp are updated ( tmp_i=f(msg_i) ). Then in the second kernel the messages msg, are computed (msg_i= f(tmp_i) ).

I had to split the implementation in two kernels, because the two different operations put together in the same kernel didn’t give back the expected results, whereas seperated kernels did. (Even though I had “__syncthreads()” command between the two).

Here nb_iter is the number of iteration.

The problem:

[b]Whenever nb_iter goes over 12 iterations, I don’t get the expected results, ( I compare with a CPU function).

Did anybody already have this kind of problem ?[/b]

If nb_iter==11, I have the same results as the CPU…

I don’t think that this is precision problem, since I only works with integers or float.

for (int it=0; it<nb_iter; it++){

		printf("\n iteration n %d",it);

	Kernel_BP1<<<block2, thread2 >>>	// Update tmp

		(d_cost_data_BP,

		 msg_l, msg_r, msg_u, msg_d,

		 tmp_l, tmp_r, tmp_u, tmp_d,		 

		 nbcol,nbrow,nblabel,

		 Disp_Param);

	cutilCheckMsg("Kernel execution failed");

	cudaThreadSynchronize();

	Kernel_BP2<<<block2, thread2>>>		// Compute msg

		(   tmp_l, tmp_r, tmp_u, tmp_d,

			msg_l, msg_r, msg_u, msg_d,

			nbcol, nbrow,

			S0, S1,

			nblabel);

	cutilCheckMsg("Kernel execution failed");

	cudaThreadSynchronize();

	}

It is pretty hard to say anything without at least some indication of what the kernels actually do, how they use the variables they “share” in global memory, and how you are determining that the results are “wrong”.

I have an explicit convection-diffusion solver which repeats three kernels (ghost cell calculation, spatial discretization, and Runge-Kutta stage update) sequentially in conjuction with some device to device memcpy() calls literally millions of times in a loop while keeping the working data set on the device in global memory without incident, so I very much doubt it is the high level structure of the code which is causing you problems.

I can give the code of the kernel, but I was really surprised that it would be wrong only after the 12th iteration and not before.

I know the results are wrong because I am doind a comparison pixel by pixel with a CPU version (the final results is contained in d_disp (disparity_map)). I was wondering if it was a general mistake appearing with loops, that somebody already had.

The code that i’m trying to implement is a belief propagation computation.

Here is the kernels that I used and in which order:

for (int it=0; it<nb_iter; it++){

		printf("\n iteration n %d",it);

	Kernel_BP1<<<block2, thread2 >>>	// Update tmp

		(d_cost_data_BP,

		 msg_l, msg_r, msg_u, msg_d,

		 tmp_l, tmp_r, tmp_u, tmp_d,		 

		 nbcol,nbrow,nblabel,

		 Disp_Param);

	cutilCheckMsg("Kernel execution failed");

	cudaThreadSynchronize();

	Kernel_BP2<<<block2, thread2>>>		// Compute msg

		(   tmp_l, tmp_r, tmp_u, tmp_d,

			msg_l, msg_r, msg_u, msg_d,

			nbcol, nbrow,

			S0, S1,

			nblabel);

	cutilCheckMsg("Kernel execution failed");

	cudaThreadSynchronize();

	}

	

	Kernel_BP3<<<block2, thread2 >>> // Compute belief vectors

		(d_cost_data_BP,

		 d_disp,

		 msg_l,

		 msg_u,

		 msg_r,

		 msg_d,

		 nbcol,

		 nbrow,

		 nblabel,

		 offset);

	cutilCheckMsg("Kernel execution failed");

	cutilSafeCall(cudaMemcpy( disp_map, d_disp , nbcol*nbrow*sizeof(float), cudaMemcpyDeviceToHost));

And here are the kernels themselves

extern "C"

__global__ void 

Kernel_BP1

		(int * d_cost_data_BP,		 

		 float * d_msg_l,

		 float * d_msg_r,

		 float * d_msg_u,

		 float * d_msg_d,

		 float * d_tmp_l,

		 float * d_tmp_r,

		 float * d_tmp_u,

		 float * d_tmp_d,

		 int nbcol,int nbrow,int nblabel,

		 DISPARITY_PARAMETERS Disp_Param)

{

	int		j,lab,it,jj, nbcol2;

	int		ind,ind0,indd,indu,indn;

	

	const int idx=threadIdx.x+ blockDim.x*blockIdx.x;

	const int idy=threadIdx.y+ blockDim.y*blockIdx.y;

	nbcol2=nbcol+2;

	// Iterative Belief Propagation

	if ((idx>=1) && (idy >=1) && (idx<=nbcol) && (idy <=nbrow)){

			

		if ((idx==623) && (idy==56)){

			int a=3+8;

			int bb=a+8;

		}

		j=idy*nbcol2+idx;

		ind0=j*nblabel;		// p(x,y) lab=0

		jj=nbcol2*nblabel;	// # of pixels per row // Use to comp		

		indu=ind0-jj;		// p(x,y-1) lab=0

		indd=ind0+jj;		// p(x,y+1) lab=0	

		for(lab=0;lab<nblabel;lab++) // Loop on label

		{

			ind=ind0+lab; // p(x,y) lab=lab

			//*/************************** I **************************************/

			//*/ Création des messages reçus  et recherche du msg minimum**********/

			// Left message

			indn=ind-nblabel;

			d_tmp_l[ind]=  (d_cost_data_BP[indn])+ d_msg_l[indn]+d_msg_u[indn]+d_msg_d[indn];

			// Right message

			indn=ind+nblabel;

			d_tmp_r[ind]=(d_cost_data_BP[indn])+ d_msg_r[indn]+d_msg_u[indn]+d_msg_d[indn];

			// Up message

			indn=indu+lab;

			d_tmp_u[ind]= (d_cost_data_BP[indn])+ d_msg_u[indn]+d_msg_l[indn]+d_msg_r[indn];

			// Down message

			indn=indd+lab;

			d_tmp_d[ind]= (d_cost_data_BP[indn])+ d_msg_d[indn]+d_msg_l[indn]+d_msg_r[indn];

		

		}	

		

	} // Loop on pixels

	

} // <<<<<  End of Kernel

extern "C"

__global__ void 

Kernel_BP2(

			float *d_tmp_l,

			float *d_tmp_r,

			float *d_tmp_u,

			float *d_tmp_d,

			float *d_msg_l,

			float *d_msg_r,

			float *d_msg_u,

			float *d_msg_d,

			int nbcol, int nbrow,

			int S0, int S1,

			int nblabel)

{

	const int idx=threadIdx.x+ blockDim.x*blockIdx.x;

	const int idy=threadIdx.y+ blockDim.y*blockIdx.y;

	float msg_minl=FLT_MAX, msg_minr=FLT_MAX, msg_minu=FLT_MAX, msg_mind=FLT_MAX;

	int lab, ind, ind0;

	if ((idx >=1) && (idy >=1) && (idx<=nbcol) && (idy <=nbrow)){

		if ((idx==623) && (idy==56)){

			int a=3+8;

			int bb=a+8;

		}

		ind0=(idy*(nbcol+2)+idx)*nblabel;

		compute_min(&msg_minl,&msg_minr, &msg_minu, &msg_mind,

			d_tmp_l, d_tmp_r, d_tmp_u, d_tmp_d,

			ind0,  nbcol, nbrow,  nblabel);

		for(lab=0;lab<nblabel;lab++) // Loop on label

		{

			ind=ind0+lab; // Indices global en 3D

			

			d_msg_l[ind]=MMIN(d_tmp_l[ind]+S0,msg_minl+S1);

			d_msg_r[ind]=MMIN(d_tmp_r[ind]+S0,msg_minr+S1);

			d_msg_u[ind]=MMIN(d_tmp_u[ind]+S0,msg_minu+S1);

			d_msg_d[ind]=MMIN(d_tmp_d[ind]+S0,msg_mind+S1);

		}

	}// Loop on pixel

}

extern "C"

__global__ void 

Kernel_BP3

		(int * d_cost_data_BP,

		 float * d_res,

		 float * d_msg_l,

		 float * d_msg_u,

		 float * d_msg_r,

		 float * d_msg_d,

		 int nbcol,

		 int nbrow,

		 int nblabel,

		 int offset)

{

	float min_cost,cost;

	int lab, label, ind,j,ind0;

	const int idx= threadIdx.x+blockDim.x*blockIdx.x;

	const int idy= threadIdx.y+blockDim.y*blockIdx.y;

	int nbcol2=nbcol+2;

	int nbrow2=nbrow+2;

	if ((idx>=1) && (idy >=1) && (idx<nbcol2-1) && (idy <nbrow2-1))

	{

		j=idy*nbcol2+idx;

		ind0=j*nblabel;	

		

		min_cost=FLT_MAX;

		// Belief nblabel-vector computation and identification of the minimal value

		// Estimation of the disparity map

		// For each pixel : computation of the belief vector

		//					selection of the disparity value that provides the minimal belief cost

		for(lab=0;lab<nblabel;lab++)	// Loop on labels for the current pixel q

		{

			ind=ind0+lab;

			cost= (float ) (d_cost_data_BP[ind] +d_msg_l[ind] +d_msg_r[ind] +d_msg_u[ind] +d_msg_d[ind]);

			//if(lab==0) cost0=cost;

			if(cost<=min_cost)

			{

				min_cost=cost;

				label=lab;

			}		

		} // Loop on labels

		d_res[(idy-1)*nbcol+idx-1]=float(offset+label);

	} // Loop on pixels

}

Do you get a progressive deterioration in the agreement between the CPU and GPU versions, or do you get perfect agreement (or at least within the expect accuracy range of single precision floating point) until a certain point, and then it breaks spectacularly?

Oh and I forgot to say, that if I run the code in emulation mode, then the problem doesn’t show up… as usual…
The only data shared in global memory are the float arrays.

  • tmp_l, tmp_r, tmp_u, tmp_d,
  • msg_l, msg_r, msg_u, msg_d

All of the following arrays are of size nbcol2nbrow2nblabel. Because for each pixel of the map (of size nbcol*nbrow), we need to record the data cost corresponding to certain label value). (To avoid reaching out of the map, we add a 2 pixel appron, nbcol2=(nbcol+2) and nbrow2=nbrow+2)). nblabel is the number of label that we are working with.

It breaks only at the 12th iteration.

I don’t get any deteriotation before. Of course for the 13th, 14 th (and so on) iteration, the deterioration has an increasing impact.

Here is the comparison function that I use :

bool comparison(float * a, float *b, int nbcol,int nbrow,int pitch){

	

	if (pitch==NULL){

		pitch=nbcol;

	}

	int i,j,idx;

	float error;

	bool res=true;

	int nb=0;

	float count=0;

	for (j=0;j<nbrow;j++){

		for (i=0;i<nbcol;i++){

			idx=j*pitch+i;	

			res=res&&(abs(a[idx]-b[idx])< 0.001);

			if (abs(a[idx]-b[idx]) >= 0.001){

			error=abs(a[idx]-b[idx]);

				count+=error;

				nb++;

				printf("\n%d %d : (%f %f) , erreur : %2f",i,j,a[idx],b[idx],error);

			}

				

		}

	}

	switch(res){

		case true : printf("\t Identical \n");

					break;

		case false : printf("\n Different \n");

				printf("Mean error: %f\n",(float) ((float)count/(float)nb));

				printf(" Number of errors %d soit %f %%\n", nb, (float) (nb/(1024*768)));

					break;

	}

	return res;

}

And here are the results that I get for 12 iterations :

idx   idy	  a(idx,idy)  b(idx,idy)  abs(a(idx,idy)-b(idx,idy))

622 55 : (19.000000 20.000000) , erreur : 1.000000

546 104 : (18.000000 20.000000) , erreur : 2.000000

192 145 : (23.000000 18.000000) , erreur : 5.000000

961 163 : (23.000000 14.000000) , erreur : 9.000000

995 163 : (19.000000 21.000000) , erreur : 2.000000

543 171 : (20.000000 19.000000) , erreur : 1.000000

469 204 : (18.000000 19.000000) , erreur : 1.000000

109 250 : (23.000000 24.000000) , erreur : 1.000000

108 252 : (23.000000 24.000000) , erreur : 1.000000

109 252 : (23.000000 24.000000) , erreur : 1.000000

613 253 : (20.000000 23.000000) , erreur : 3.000000

145 257 : (23.000000 24.000000) , erreur : 1.000000

610 258 : (23.000000 26.000000) , erreur : 3.000000

383 277 : (21.000000 22.000000) , erreur : 1.000000

389 281 : (22.000000 21.000000) , erreur : 1.000000

286 288 : (24.000000 18.000000) , erreur : 6.000000

293 290 : (18.000000 24.000000) , erreur : 6.000000

82 308 : (19.000000 20.000000) , erreur : 1.000000

387 310 : (25.000000 22.000000) , erreur : 3.000000

651 315 : (23.000000 24.000000) , erreur : 1.000000

164 325 : (24.000000 23.000000) , erreur : 1.000000

424 360 : (22.000000 24.000000) , erreur : 2.000000

671 436 : (23.000000 21.000000) , erreur : 2.000000

12 456 : (28.000000 27.000000) , erreur : 1.000000

12 459 : (27.000000 28.000000) , erreur : 1.000000

708 461 : (19.000000 23.000000) , erreur : 4.000000

370 473 : (36.000000 24.000000) , erreur : 12.000000

386 504 : (25.000000 24.000000) , erreur : 1.000000

359 664 : (31.000000 30.000000) , erreur : 1.000000

 Different arrays

 Mean error : 2.586207

 Number of errors 29 soit 0.000000 % (< 10E-4)

I always get the errors for the same idx, and idy values

I’ve tested in emulation mode see what happens with those (knowing that in emulation, the problem does not appear); and I don’t find what is the difference with other values that are computed correctly.

Just a note: you don’t need cudaThreadSynchronize(); after each kernel call.

Also, why do you have the kernels declared to be extern “C”?

Yeah you’re right I only need the extern C for the functions that I call in the main program, not for the kernels (which are called in a particular function. But I had lots of trouble linking everything correctly in my first program, so I had put “extern C” before each function (even the kernels). But I can get rid of those now.

Concerning the cudaTreadSynchronize=(); commend, this was just an attempt to see if the error came from a lack of synchonization. I thought that after 12 successives launch from the CPU, the device did not handle the thread termination very well. But it didn’t have any impact on the final result.

Your comparison function is broken. abs() is an integer valued function. you want to use fabs() to compare floats. As it is, you have implicit float->integer truncation happening is several places.

I already had problems between types (but between double and float) when executing on the device (for another kernel). I’m implementing a CPU code that I did not write myself. Even though this kind of comparison/truncation/… works on CPU, it definitely hasn’t the same behavior on the device.

I’m going to correct the code, i’ll let you know if this changes anything.

Thanks for your hints.

I got rid of the float-integer truncation, and spent some more time looking more precisely into the code. But I still don’t know what’s going and why it is producing error so abruptly at the 12th iteration…

I tested it with CUDA 2.2 and with a GTX280, after a few adaptations (my colleague was using Visual express 2008) , I finally got the same error again…

If anybody had another hint, I would really appreciate it, thank you !

I narrowed down the problem to this line

cost= d_cost_data_BP[ind] +d_msg_l[ind] +d_msg_r[ind] +d_msg_u[ind] +d_msg_d[ind];

(in Kernel_BP3)

I checked that the d_cost_data_BP , d_msg_l, d_msg_r, d_msg_u, d_msg_d, are exactly the same as those produced by the CPU.

The error is occuring somewhere during the sum. If I check

cost = d_cost_data_BP [ind] + d_msg_l[ind] ;

or

cost= d_cost_data_BP [ind] + d_msg_r[ind] ;

and so on…

Then I got the exact same result as the CPU (if the CPU function has been changed this way).

It is only when I have the whole expression that I get mistakes.

I don’t know where it comes from, I don’t see any variable that the threads might be sharing/using at the same time (therefore producing the mistake)).

Anybody would know where this all come from ??

Thank You!