syoon
February 10, 2011, 11:50pm
1
do you think the following transformation from cpu into cuda is correct?
somehow i am getting a little different outcome.
the location of __syncthreads(); is not a big factor…
what did i do wrong?
any comments are welcome and thanks in advance…
=================================================================
cuda version
{
int i,j,k,ns,nc,m;
REAL temp;
nc=kmax+2;
ns=nc*(jmax+2);
for (int slide = 0; slide <= (kmax+blockDim.z-1); slide += blockDim.z)
{
i = blockDim.xblockIdx.x+threadIdx.x;
j = blockDim.y blockIdx.y+threadIdx.y;
k = slide + threadIdx.z;
if ( i<=imax && j<=jmax && k<=kmax )
{
m = ins+j nc+k ;
temp= (PHI[m]-PHI[m-ns])/dltx+(PHI[m+ns]-PHI[m])/dltx ;
PHI[m] = PHI[m]-delt*temp;
}
__syncthreads();
}
======================================================================
cpu version
for (i=0;i<=imax;i++)
for (j=0;j<=jmax;j++)
for (k=0;k<=kmax;k++)
{
m = ins+j nc+k ;
temp=(PHI[m]-PHI[m-ns])/dltx+(PHI[m+ns]-PHI[m])/dltx;
PHI[m] = PHI[m]-delt*temp;
}
varslan
February 11, 2011, 1:44pm
2
Hi syoon,
I suppose your PHI is in global memory.
__syncthreads(); synchronize only in a thread block,
So your call PHI[m+ns],PHI[m] and PHI[m-ns] are not safe (threads race), because some other thread may have change the value.
syoon
February 11, 2011, 5:37pm
3
Thanks for your reply…
yes. its in global memory.
i am kind of new to cuda and i was suspecting PHI might have problems you just described, but i was not sure…
does this mean i need to move that to shared memory first? is there other way than using shared memory because size may become bigger.
i will try but please would give me some advice?
varslan
February 11, 2011, 7:33pm
4
======================================================================
cpu version
for (i=0;i<=imax;i++)
for (j=0;j<=jmax;j++)
for (k=0;k<=kmax;k++)
{
m = i*ns+j*nc+k ;
temp=(PHI[m]-PHI[m-ns])/dltx+(PHI[m+ns]-PHI[m])/dltx;
PHI[m] = PHI[m]-delt*temp;
}
Check first if your CPU version is good:
temp=(PHI[m]-PHI[m-ns])/dltx+(PHI[m+ns]-PHI[m])/dltx; <==> temp=(PHI[m+ns]-PHI[m-ns])/dltx;
Moreover with
m = ins+j nc+k ;
nc=kmax+2;
ns=nc*(jmax+2);
when i=j=k=0 => m=0 you should have a segmentation fault on PHI[m-ns]
syoon
February 14, 2011, 8:18pm
5
well. you have an eagle’s eye…
the index in c version starts from 1…
so for cuda version i have
i = blockDim.x*blockIdx.x+threadIdx.x+1;
j = blockDim.y*blockIdx.y+threadIdx.y+1;
k = slide + threadIdx.z+1;
i didnt want to confuse some people with “+1”.
original code is much longer than i put to make it simple…
i try the followings.
if ( i<=imax && j<=jmax && k<=kmax )
{
m = i*ns+j*nc+k ;
PHI_m = PHI[m];
PHI_mms = PHI[m-ns];
PHI_pms = PHI[m+ns];
__syncthreads();
CNVTPHI=(PHI_m-PHI_mms)/dltx+(PHI_pms-PHI_m)/dltx;
PHI[m] = PHI_m-delt*CNVTPHI;
}
__syncthreads();
theoretically i did correct, i think(?)
any problems with my thinking?
Thanks in well advance.
Check first if your CPU version is good:
temp=(PHI[m]-PHI[m-ns])/dltx+(PHI[m+ns]-PHI[m])/dltx; <==> temp=(PHI[m+ns]-PHI[m-ns])/dltx;
Moreover with
m = ins+j nc+k ;
nc=kmax+2;
ns=nc*(jmax+2);
when i=j=k=0 => m=0 you should have a segmentation fault on PHI[m-ns]