Hello,
I am new to GPU programming and trying to port my code to GPU. I started from porting one function of my code to GPU with openacc and I got wrong results.
Before entering this function, I move data explicitly to the device and then copy some arrays back to the host
#pragma acc enter data copyin(sifq[0:2*nfc],sxq[0:nx*nq],sq[0:nv*nq],sdxdx[0:nx*nx*nq],sdqdx[0:nx*nv*nq], \
saux[0:naux*nq],srhs[0:nv*nq],sxc[0:nx*nfc],swnc[0:(nx+1)*nfc],swxdc[0:nfc], \
sauxf[0:nauxf*nfc])
obj->compute( ics,ice, sifq, sxq,sq,sdxdx,sdqdx,saux,srhs,
sxc,swnc,swxdc,sauxf, nfc, nq );
#pragma acc exit data copyout (srhs[0:nv*nq],sauxf[0:nauxf*nfc])
In this this “compute” function:
void cClass::compute( Int ics,Int ice, Int *sicq, Real *sxq, Real *sq, Real *sdxdx, Real *sdqdx, Real *saux, Real *srhs,
Real *sxc, Real *swc, Real *swxdc, Real *sauxc, Int nfc, Int nq )
{
Real dql[MxNVs],dqr[MxNVs];
Real dql0[MxNVs],dqr0[MxNVs];
Real auxl[MxNVs],auxr[MxNVs];
Int ix,iql,iqr,ia;
Real xn[3],wn[3];
Real al,unl,rl,ll1,ll3,ll4,pl,hl,fl[MxNVs],ql[MxNVs],dhl,dal;
Real ar,unr,rr,lr1,lr3,lr4,pr,hr,fr[MxNVs],qr[MxNVs],dhr,dar;
Real le1,le3,le4;
Real aa,a2a,ra,ha,ka,qa[MxNVs],ana[3],una,unaa,raa, la1,la4,la3,lmax,fa[MxNVs];
Real dw1,dw3,dw4,dw2[3],dw2a,dr,dq[MxNVs],dun,dp,dpa;
Real dur[3],dunr,dpr,drr,dtr,tr,dt;
Real dwl[MxNVs],dwr[MxNVs];
Real mr,ml,wl,wr,cp;
Real f[MxNVs], dw5[MxNVs];
cp= some_value;
#pragma acc parallel loop \
private(dql0,dqr0,\
dql,dqr,\
auxl,auxr,\
ix,iql,iqr,ia,\
xn,wn,\
al,unl,rl,ll1,ll3,ll4,pl,hl,fl,ql,dhl,dal, \
ar,unr,rr,lr1,lr3,lr4,pr,hr,fr,qr,dhr,dar, \
le1,le3,le4, \
aa,a2a,ra,ha,ka,qa,ana,una,unaa,raa, la1,la4,la3,lmax,fa, \
dw1,dw3,dw4,dw2,dw2a,dr,dq,dun,dp,dpa, \
dur,dunr,dpr,drr,dtr,tr,dt, \
dwl,dwr, \
mr,ml,wl,wr,cp, \
f, dw5 ) \
present(sq[0:nv*nq],sauxc[0:nauxf*nfc], saux[0:naux*nq], \
sxq[0:nx*nq],srhs[0:nv*nq],sdqdx[0:nv*nx*nq],sicq[0:2*nfc], \
sxc[0:nx*nfc],swc[0:(nx+1)*nfc],swxdc[0:nfc],sdxdx[0:nx*nx*nq])
for( Int ic=ics;ic<ice;ic++ )
{
//some computations
.....
//scatter results to srhs
#pragma acc atomic
srhs[ADDR_(0,iql,nq)]-= f[0];
.....
#pragma acc atomic
srhs[ADDR_(0,iqr,nq)]+= f[0];
.....
}
}
In the function, “Real” is “double” and “Int” is “int”, “ADDR_” is a macro function.
Since I am only modifying this function, If I got the wrong results from the whole code in the end, the error can only come from this function. I have tried to compile the code in the CPU with “-acc=multicore” and I have the correct results. It is when I compile it in the GPU, the results become wrong but they don’t change if I run the code a few times, so I think the “atomic” seems working.
I am using a V100 card and my hpc_sdk version is 22.2. I compile the code with the following option: -fast -acc -Minfo=accel -gpu=cc70,nordc $(DEFS) -fPIC -Wall
Thanks for your help in advance,
Feng