How to insert openacc multicore parallel loops into nvfortran GPU code

In my 3D ocean model, there is a sequential part to implicitly solve the sea surface elevation.
In GPU version (nvfortran -stdpar=gpu -acc=gpu -gpu=nomanaged) of the model, CPU computes this part (DO I = 1, NPOI) as shown below. Here NPOI=43254. If GPU computes this part, fully parallel leads model fails to converge due to dependency, technically rewriten128 threads parallel is slower than CPU sequential, and more than 256 threads reports error “Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution”.

!$ACC UPDATE HOST (APOI,RHSPOI,XPOI) !GPU->CPU
      
      DO ITERR=1,ITMAX
          
          ERR_NRM=0.
          X_NRM=0.

          DO I = 1, NPOI !Wish to have 32 to 512 threads here
              IDX_S=IAPOI(I)
              IDX_E=IAPOI(I+1)-1
              ERR_TMP=RHSPOI(I)
              DO K=IDX_S,IDX_E
                  J=JAPOI(K)
                  ERR_TMP=ERR_TMP-APOI(K)*XPOI(J)
              ENDDO
              ERR_NRM=ERR_NRM+ERR_TMP*ERR_TMP
              XPOI(I)=XPOI(I)+OMEGA*ERR_TMP
              X_NRM=X_NRM+XPOI(I)**2
          ENDDO
          
          RELERR=OMEGA*SQRT(ERR_NRM/X_NRM)
          IF (RELERR.LT.TOL)  THEN
!$ACC UPDATE DEVICE (XPOI) !CPU->GPU
              RETURN
          ENDIF
          
      ENDDO

But in Multicore version (nvfortran -stdpar=multicore) of the model, 32 threads parallel of this part shown below works fine.

      CALL ACC_SET_NUM_CORES(32)
      
      DO ITERR=1,ITMAX
          
          ERR_NRM=0.
          X_NRM=0.

          DO concurrent(I=1:NPOI) REDUCE(+:ERR_NRM,X_NRM) 
     *LOCAL(IDX_S,IDX_E,ERR_TMP,K,J)
              IDX_S=IAPOI(I)
              IDX_E=IAPOI(I+1)-1
              ERR_TMP=RHSPOI(I)
              DO K=IDX_S,IDX_E
                  J=JAPOI(K)
                  ERR_TMP=ERR_TMP-APOI(K)*XPOI(J)
              ENDDO
              ERR_NRM=ERR_NRM+ERR_TMP*ERR_TMP
              XPOI(I)=XPOI(I)+OMEGA*ERR_TMP
              X_NRM=X_NRM+XPOI(I)**2
          ENDDO
          
          RELERR=OMEGA*SQRT(ERR_NRM/X_NRM)
          IF (RELERR.LT.TOL)  THEN
              RETURN
          ENDIF
          
      ENDDO      

So in GPU version, I want this part to be openacc multicore parallel with 32 threads. Under “-stdpar=gpu -acc=gpu -gpu=nomanaged”, how do I achieve this?

Thanks!

While I haven’t tried this myself, given DO CONCURRENT is built on top of OpenACC you should be able to call “acc_set_device_type” with “acc_device_host” as the argument, the call it after with “acc_device_nvidia” to reset back to target the device.

You’ll also want to compile with “-stdpar -acc -target=gpu,multicore” or “-stdpar=gpu,multicore -acc=gpu,multicore”, depending on your compiler version. The “-target” flag is new. This will have the compiler target both NVIDIA GPUs as well as multicore CPU.

The “-stdpar -acc -target=gpu,multicore” flags work. Thanks!