Cuda kernels crash - please help me out

Hi guys, I am programming with CUDA a CFD software. At a certain level I need to solve a tri-diagonal system of equations. To do that, I decided to code two different options:either to let the CPU solve it by transferring data from the GPU to the CPU using a standard serial algorithm such as the Thomas algorithm, or let the GPU try and solve it by using the cyclic reduction algorithm. After doing some preliminary debugging on the cyclic reduction kernels side, I still have problems with the GPU solution. The kernels crash most of the times, they typically tend to give the correct results at the first iterations then, even for say steady conditions of the results, errors start to propagate and the whole thing blows up. I initially thought that the problem could be traced back to the floating point accuracy. I also compiled with sm_13 enabling double precision, I only obtained some slight improvements but nothing much. I am using the Nvidia TESLA C1060 and I compile using the CUDA SDK 2.2 toolkit. Below I put the two relevant kernels of the cyclic reduction algorithm (reduction and substitution). If anyone could help me out and shed some light I would be extremely grateful…

global void CR_reduction(double *LOLD,double *DOLD,double *UOLD,double *BOLD,double *LNEW,double *DNEW,double *UNEW,double *BNEW,float *RESPR,int NMAX){

int x = threadIdx.x + blockIdx.x * blockDim.x;

//Remember : NMAX refers to the previous reduction phase
if(x>=NMAX)return;

if(!x){
	//FIRST EQUATION OF REDUCTION PHASE
	LNEW[x]=0.0;//west coefficient is nil
	DNEW[x]=DOLD[x]-UOLD[x]*(LOLD[x+1]/DOLD[x+1]);
	UNEW[x]=-UOLD[x]*(UOLD[x+1]/DOLD[x+1]);
	BNEW[x]=BOLD[x]-UOLD[x]*(BOLD[x+1]/DOLD[x+1]);
	if(RESPR[x]<RESPR[x+1])RESPR[x]=RESPR[x+1];
}
else if((x & (2-1))==0 && x){
	if((x+1)==(NMAX-1)){
		LNEW[x/2]=-LOLD[x]*(LOLD[x-1]/DOLD[x-1]);
		DNEW[x/2]=DOLD[x]-(LOLD[x]*(UOLD[x-1]/DOLD[x-1]))-(UOLD[x]*(LOLD[x+1]/DOLD[x+1]));
		UNEW[x/2]=0.0;
		BNEW[x/2]=BOLD[x]-(LOLD[x]*(BOLD[x-1]/DOLD[x-1]))-(UOLD[x]*(BOLD[x+1]/DOLD[x+1]));
		if(RESPR[x]<RESPR[x+1])RESPR[x/2]=RESPR[x+1];
		else RESPR[x/2]=RESPR[x];
	}
	else{
		LNEW[x/2]=-LOLD[x]*(LOLD[x-1]/DOLD[x-1]);
		DNEW[x/2]=DOLD[x]-(LOLD[x]*(UOLD[x-1]/DOLD[x-1]))-(UOLD[x]*(LOLD[x+1]/DOLD[x+1]));
		UNEW[x/2]=-UOLD[x]*(UOLD[x+1]/DOLD[x+1]);
		BNEW[x/2]=BOLD[x]-(LOLD[x]*(BOLD[x-1]/DOLD[x-1]))-(UOLD[x]*(BOLD[x+1]/DOLD[x+1]));
		if(RESPR[x]<RESPR[x+1])RESPR[x/2]=RESPR[x+1];
		else RESPR[x/2]=RESPR[x];
	}
}

}

global void CR_substitution(double *L,double *D,double *U,double *B,int *J,int NMAX,float *X){
//L->lower diagonal
//D->main diagonal
//U->upper diagonal
//B->source term
//J->index of array
//NMAX->number of unknowns for given backsubstitution phase
//X->solution vector

int x = threadIdx.x + blockIdx.x * blockDim.x;

//Remember : NMAX refers to the previous reduction phase
if(x>=NMAX)return;

if(!x && X[J[x]]==0.0){
	//first node where equation is solved
	X[J[x]]=B[x]/D[x];
}
else if(x && (x!=(NMAX-1)) && X[J[x]]==0.0){
	X[J[x]]=(B[x]/D[x])-(U[x]*(X[J[x+1]]/D[x]))-(L[x]*(X[J[x-1]]/D[x]));
}
else if(((x==(NMAX-1))) && X[J[x]]==0.0){
	X[J[x]]=(B[x]/D[x])-(L[x]*(X[J[x-1]]/D[x]));
}

}

I am not going to solve your crash for you - but you should try running this thing in Emulation mode using a debugging tool called Valgrind. It will tell you when and where your kernel produced a segmentation fault.

Thanks for your tip, but if I use the emulation debbuger mode of Developer Studio would that be still ok?

A segfault will clearly be indicated in the debug mode of Visual Studio and the program stops at that position. But you do not get the same kind of “fault log” (or history) that a Valgrind log would provide.

How do you invoke those functions and how do you declare variables which are passed into those functions?
For example, this code works only for even NMAX, right?
Also values inside J array may influence the error.