transformation a c function from into cuda.

do you think the following transformation from cpu into cuda is correct?

somehow i am getting a little different outcome.

the location of __syncthreads(); is not a big factor…

what did i do wrong?

any comments are welcome and thanks in advance…

=================================================================

cuda version

{
int i,j,k,ns,nc,m;
REAL temp;

nc=kmax+2;
ns=nc*(jmax+2);

for (int slide = 0; slide <= (kmax+blockDim.z-1); slide += blockDim.z)
{

i = blockDim.xblockIdx.x+threadIdx.x;
j = blockDim.y
blockIdx.y+threadIdx.y;
k = slide + threadIdx.z;

if ( i<=imax && j<=jmax && k<=kmax )
{
m = ins+jnc+k ;
temp= (PHI[m]-PHI[m-ns])/dltx+(PHI[m+ns]-PHI[m])/dltx ;
PHI[m] = PHI[m]-delt*temp;
}
__syncthreads();

}

======================================================================
cpu version

for (i=0;i<=imax;i++)
for (j=0;j<=jmax;j++)
for (k=0;k<=kmax;k++)
{
m = ins+jnc+k ;
temp=(PHI[m]-PHI[m-ns])/dltx+(PHI[m+ns]-PHI[m])/dltx;
PHI[m] = PHI[m]-delt*temp;
}

Hi syoon,

I suppose your PHI is in global memory.
__syncthreads(); synchronize only in a thread block,
So your call PHI[m+ns],PHI[m] and PHI[m-ns] are not safe (threads race), because some other thread may have change the value.

Thanks for your reply…

yes. its in global memory.

i am kind of new to cuda and i was suspecting PHI might have problems you just described, but i was not sure…

does this mean i need to move that to shared memory first? is there other way than using shared memory because size may become bigger.

i will try but please would give me some advice?

Check first if your CPU version is good:

temp=(PHI[m]-PHI[m-ns])/dltx+(PHI[m+ns]-PHI[m])/dltx; <==> temp=(PHI[m+ns]-PHI[m-ns])/dltx;

Moreover with

m = ins+jnc+k ;

nc=kmax+2;

ns=nc*(jmax+2);

when i=j=k=0 => m=0 you should have a segmentation fault on PHI[m-ns]

well. you have an eagle’s eye…

the index in c version starts from 1…

so for cuda version i have

i = blockDim.x*blockIdx.x+threadIdx.x+1;

j = blockDim.y*blockIdx.y+threadIdx.y+1;

k = slide + threadIdx.z+1;

i didnt want to confuse some people with “+1”.

original code is much longer than i put to make it simple…

i try the followings.

if ( i<=imax && j<=jmax && k<=kmax )

{

      m = i*ns+j*nc+k ;

      PHI_m = PHI[m];

      PHI_mms = PHI[m-ns];

      PHI_pms = PHI[m+ns];

      __syncthreads();

      CNVTPHI=(PHI_m-PHI_mms)/dltx+(PHI_pms-PHI_m)/dltx;

      PHI[m] = PHI_m-delt*CNVTPHI;

}

__syncthreads();

theoretically i did correct, i think(?)

any problems with my thinking?

Thanks in well advance.