In my 3D ocean model, there is a sequential part to implicitly solve the sea surface elevation.
In GPU version (nvfortran -stdpar=gpu -acc=gpu -gpu=nomanaged) of the model, CPU computes this part (DO I = 1, NPOI) as shown below. Here NPOI=43254. If GPU computes this part, fully parallel leads model fails to converge due to dependency, technically rewriten128 threads parallel is slower than CPU sequential, and more than 256 threads reports error “Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution”.
!$ACC UPDATE HOST (APOI,RHSPOI,XPOI) !GPU->CPU
DO ITERR=1,ITMAX
ERR_NRM=0.
X_NRM=0.
DO I = 1, NPOI !Wish to have 32 to 512 threads here
IDX_S=IAPOI(I)
IDX_E=IAPOI(I+1)-1
ERR_TMP=RHSPOI(I)
DO K=IDX_S,IDX_E
J=JAPOI(K)
ERR_TMP=ERR_TMP-APOI(K)*XPOI(J)
ENDDO
ERR_NRM=ERR_NRM+ERR_TMP*ERR_TMP
XPOI(I)=XPOI(I)+OMEGA*ERR_TMP
X_NRM=X_NRM+XPOI(I)**2
ENDDO
RELERR=OMEGA*SQRT(ERR_NRM/X_NRM)
IF (RELERR.LT.TOL) THEN
!$ACC UPDATE DEVICE (XPOI) !CPU->GPU
RETURN
ENDIF
ENDDO
But in Multicore version (nvfortran -stdpar=multicore) of the model, 32 threads parallel of this part shown below works fine.
CALL ACC_SET_NUM_CORES(32)
DO ITERR=1,ITMAX
ERR_NRM=0.
X_NRM=0.
DO concurrent(I=1:NPOI) REDUCE(+:ERR_NRM,X_NRM)
*LOCAL(IDX_S,IDX_E,ERR_TMP,K,J)
IDX_S=IAPOI(I)
IDX_E=IAPOI(I+1)-1
ERR_TMP=RHSPOI(I)
DO K=IDX_S,IDX_E
J=JAPOI(K)
ERR_TMP=ERR_TMP-APOI(K)*XPOI(J)
ENDDO
ERR_NRM=ERR_NRM+ERR_TMP*ERR_TMP
XPOI(I)=XPOI(I)+OMEGA*ERR_TMP
X_NRM=X_NRM+XPOI(I)**2
ENDDO
RELERR=OMEGA*SQRT(ERR_NRM/X_NRM)
IF (RELERR.LT.TOL) THEN
RETURN
ENDIF
ENDDO
So in GPU version, I want this part to be openacc multicore parallel with 32 threads. Under “-stdpar=gpu -acc=gpu -gpu=nomanaged”, how do I achieve this?
Thanks!