Hi guys, I am programming with CUDA a CFD software. At a certain level I need to solve a tri-diagonal system of equations. To do that, I decided to code two different options:either to let the CPU solve it by transferring data from the GPU to the CPU using a standard serial algorithm such as the Thomas algorithm, or let the GPU try and solve it by using the cyclic reduction algorithm. After doing some preliminary debugging on the cyclic reduction kernels side, I still have problems with the GPU solution. The kernels crash most of the times, they typically tend to give the correct results at the first iterations then, even for say steady conditions of the results, errors start to propagate and the whole thing blows up. I initially thought that the problem could be traced back to the floating point accuracy. I also compiled with sm_13 enabling double precision, I only obtained some slight improvements but nothing much. I am using the Nvidia TESLA C1060 and I compile using the CUDA SDK 2.2 toolkit. Below I put the two relevant kernels of the cyclic reduction algorithm (reduction and substitution). If anyone could help me out and shed some light I would be extremely grateful…

**global** void CR_reduction(double *LOLD,double *DOLD,double *UOLD,double *BOLD,double *LNEW,double *DNEW,double *UNEW,double *BNEW,float *RESPR,int NMAX){

```
int x = threadIdx.x + blockIdx.x * blockDim.x;
//Remember : NMAX refers to the previous reduction phase
if(x>=NMAX)return;
if(!x){
//FIRST EQUATION OF REDUCTION PHASE
LNEW[x]=0.0;//west coefficient is nil
DNEW[x]=DOLD[x]-UOLD[x]*(LOLD[x+1]/DOLD[x+1]);
UNEW[x]=-UOLD[x]*(UOLD[x+1]/DOLD[x+1]);
BNEW[x]=BOLD[x]-UOLD[x]*(BOLD[x+1]/DOLD[x+1]);
if(RESPR[x]<RESPR[x+1])RESPR[x]=RESPR[x+1];
}
else if((x & (2-1))==0 && x){
if((x+1)==(NMAX-1)){
LNEW[x/2]=-LOLD[x]*(LOLD[x-1]/DOLD[x-1]);
DNEW[x/2]=DOLD[x]-(LOLD[x]*(UOLD[x-1]/DOLD[x-1]))-(UOLD[x]*(LOLD[x+1]/DOLD[x+1]));
UNEW[x/2]=0.0;
BNEW[x/2]=BOLD[x]-(LOLD[x]*(BOLD[x-1]/DOLD[x-1]))-(UOLD[x]*(BOLD[x+1]/DOLD[x+1]));
if(RESPR[x]<RESPR[x+1])RESPR[x/2]=RESPR[x+1];
else RESPR[x/2]=RESPR[x];
}
else{
LNEW[x/2]=-LOLD[x]*(LOLD[x-1]/DOLD[x-1]);
DNEW[x/2]=DOLD[x]-(LOLD[x]*(UOLD[x-1]/DOLD[x-1]))-(UOLD[x]*(LOLD[x+1]/DOLD[x+1]));
UNEW[x/2]=-UOLD[x]*(UOLD[x+1]/DOLD[x+1]);
BNEW[x/2]=BOLD[x]-(LOLD[x]*(BOLD[x-1]/DOLD[x-1]))-(UOLD[x]*(BOLD[x+1]/DOLD[x+1]));
if(RESPR[x]<RESPR[x+1])RESPR[x/2]=RESPR[x+1];
else RESPR[x/2]=RESPR[x];
}
}
```

}

**global** void CR_substitution(double *L,double *D,double *U,double *B,int *J,int NMAX,float *X){

//L->lower diagonal

//D->main diagonal

//U->upper diagonal

//B->source term

//J->index of array

//NMAX->number of unknowns for given backsubstitution phase

//X->solution vector

```
int x = threadIdx.x + blockIdx.x * blockDim.x;
//Remember : NMAX refers to the previous reduction phase
if(x>=NMAX)return;
if(!x && X[J[x]]==0.0){
//first node where equation is solved
X[J[x]]=B[x]/D[x];
}
else if(x && (x!=(NMAX-1)) && X[J[x]]==0.0){
X[J[x]]=(B[x]/D[x])-(U[x]*(X[J[x+1]]/D[x]))-(L[x]*(X[J[x-1]]/D[x]));
}
else if(((x==(NMAX-1))) && X[J[x]]==0.0){
X[J[x]]=(B[x]/D[x])-(L[x]*(X[J[x-1]]/D[x]));
}
```

}