I added OpenACC data management directives, and result data are correct. The problem is that -O3 fails to speed up for a “do concurrent (k=1:nt,j=1:ny,i=1:nx)” structure. In this case, there is only 1 heavily loaded DC loop shown at bottom.
Under unified memory mode, we guess NC interface does not check if data was transferred to host (It may not be true). So we tried nomanaged mode and it’s slow.
Besides, under unified memory with OpenACC directive (-stdpar=gpu -acc=gpu), just adding 1 data transfer directive line “!$ACC UPDATE HOST (minM, EDH, edh_insty)” would lead to slow down. It’s at line 379 below.
do concurrent (k=1:nt,j=1:ny,i=1:nx)
359 d2m0 = d2m(i,j,k) - 273.15
360 t2m0 = t2m(i,j,k) - 273.15
361 sst0 = sst(i,j,k) - 273.15
362 msl0 = msl(i,j,k)/100
363 vel0 = vel(i,j,k)
367 rh=10.**(7.5* (d2m0/(237.3+d2m0)-t2m0/(237.3+t2m0))+2.)
368 z_r = 2.
373 call GetE(sst(i,j,k),t2m(i,j,k),msl0,rh,z_r,
374 * vel0, 0.00001, ME, EDH0, edh_insty0)
375 minM(i,j,k)=ME
376 EDH(i,j,k)=EDH0
377 edh_insty(i,j,k)=edh_insty0
378 enddo
379 !$ACC UPDATE HOST (minM, EDH, edh_insty)
380
In conclusion, GPU computational time under nvfortran 25.3 are:
16s : -stdpar=gpu
2.3s : -stdpar=gpu -O3
16s : -stdpar=gpu -O3 -acc=gpu -gpu=mem:separate
16s : -stdpar=gpu -O3 -acc=gpu
2.3s : -O3 -acc=gpu (rewrite DC loop into openacc form)
Thanks!