How to insert openacc multicore parallel loops into nvfortran GPU code

chenbr · March 22, 2025, 5:29am

In my 3D ocean model, there is a sequential part to implicitly solve the sea surface elevation.
In GPU version (nvfortran -stdpar=gpu -acc=gpu -gpu=nomanaged) of the model, CPU computes this part (DO I = 1, NPOI) as shown below. Here NPOI=43254. If GPU computes this part, fully parallel leads model fails to converge due to dependency, technically rewriten128 threads parallel is slower than CPU sequential, and more than 256 threads reports error “Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution”.

!$ACC UPDATE HOST (APOI,RHSPOI,XPOI) !GPU->CPU
      
      DO ITERR=1,ITMAX
          
          ERR_NRM=0.
          X_NRM=0.

          DO I = 1, NPOI !Wish to have 32 to 512 threads here
              IDX_S=IAPOI(I)
              IDX_E=IAPOI(I+1)-1
              ERR_TMP=RHSPOI(I)
              DO K=IDX_S,IDX_E
                  J=JAPOI(K)
                  ERR_TMP=ERR_TMP-APOI(K)*XPOI(J)
              ENDDO
              ERR_NRM=ERR_NRM+ERR_TMP*ERR_TMP
              XPOI(I)=XPOI(I)+OMEGA*ERR_TMP
              X_NRM=X_NRM+XPOI(I)**2
          ENDDO
          
          RELERR=OMEGA*SQRT(ERR_NRM/X_NRM)
          IF (RELERR.LT.TOL)  THEN
!$ACC UPDATE DEVICE (XPOI) !CPU->GPU
              RETURN
          ENDIF
          
      ENDDO

But in Multicore version (nvfortran -stdpar=multicore) of the model, 32 threads parallel of this part shown below works fine.

      CALL ACC_SET_NUM_CORES(32)
      
      DO ITERR=1,ITMAX
          
          ERR_NRM=0.
          X_NRM=0.

          DO concurrent(I=1:NPOI) REDUCE(+:ERR_NRM,X_NRM) 
     *LOCAL(IDX_S,IDX_E,ERR_TMP,K,J)
              IDX_S=IAPOI(I)
              IDX_E=IAPOI(I+1)-1
              ERR_TMP=RHSPOI(I)
              DO K=IDX_S,IDX_E
                  J=JAPOI(K)
                  ERR_TMP=ERR_TMP-APOI(K)*XPOI(J)
              ENDDO
              ERR_NRM=ERR_NRM+ERR_TMP*ERR_TMP
              XPOI(I)=XPOI(I)+OMEGA*ERR_TMP
              X_NRM=X_NRM+XPOI(I)**2
          ENDDO
          
          RELERR=OMEGA*SQRT(ERR_NRM/X_NRM)
          IF (RELERR.LT.TOL)  THEN
              RETURN
          ENDIF
          
      ENDDO

So in GPU version, I want this part to be openacc multicore parallel with 32 threads. Under “-stdpar=gpu -acc=gpu -gpu=nomanaged”, how do I achieve this?

Thanks!

MatColgrove · March 24, 2025, 4:31pm

While I haven’t tried this myself, given DO CONCURRENT is built on top of OpenACC you should be able to call “acc_set_device_type” with “acc_device_host” as the argument, the call it after with “acc_device_nvidia” to reset back to target the device.

You’ll also want to compile with “-stdpar -acc -target=gpu,multicore” or “-stdpar=gpu,multicore -acc=gpu,multicore”, depending on your compiler version. The “-target” flag is new. This will have the compiler target both NVIDIA GPUs as well as multicore CPU.

chenbr · March 27, 2025, 8:04am

The “-stdpar -acc -target=gpu,multicore” flags work. Thanks!

chenbr · April 9, 2025, 4:38am

Today I found that the model results by using this method are wrong. XPOI(:) seems to become Zero. The code is like below:

!$ACC UPDATE HOST (APOI,RHSPOI,XPOI) !GPU to CPU

      call acc_set_device_type(acc_device_host)

      DO ITERR = 1, ITMAX

          ERR_NRM = 0.0
          X_NRM = 0.0 

! OpenACC Multicore:
          DO concurrent(I=1:NPOI) REDUCE(+:ERR_NRM,X_NRM) 
     *LOCAL(IDX_S,IDX_E,ERR_TMP,K,J)
            IDX_S = IAPOI(I)
            IDX_E = IAPOI(I + 1) - 1
            ERR_TMP = RHSPOI(I)
            DO K = IDX_S, IDX_E
                J = JAPOI(K)
                ERR_TMP = ERR_TMP - APOI(K) * XPOI(J)
            END DO
            ERR_NRM = ERR_NRM + ERR_TMP * ERR_TMP
            XPOI(I) = XPOI(I) + OMEGA * ERR_TMP
            X_NRM = X_NRM + XPOI(I) ** 2 
          END DO

          RELERR = OMEGA * SQRT(ERR_NRM / X_NRM)
          IF (RELERR .LT. TOL) THEN
!$ACC UPDATE DEVICE (XPOI) !CPU to GPU
          call acc_set_device_type(acc_device_nvidia)
            RETURN
          END IF

      ENDDO

Compile command is “nvfortran -Mpreprocess -stdpar -acc -target=gpu,multicore -gpu=nomanaged”.

Is there anything not set correctly?

Thanks!

chenbr · April 9, 2025, 5:47am

I solved it myself. I put “!$ACC UPDATE DEVICE (XPOI) !CPU to GPU” after “call acc_set_device_type(acc_device_nvidia)” and then it is right. So I guess “call acc_set_device_type(acc_device_host)” means device is host. :-)

MatColgrove · April 9, 2025, 4:20pm

Correct. When the target device is the host, then the update becomes a no-op given it would be updating itself.

system · April 23, 2025, 4:20pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OpenACC on GPU and ISO Fortran on multicore nvc, nvc++ and nvfortran	3	560	October 6, 2023
Just released: HPC SDK 24.9 nvc, nvc++ and nvfortran	9	213	October 8, 2024
Fortran OpenACC fallback to OpenMP if there is no GPU nvc, nvc++ and nvfortran	3	787	November 2, 2020
DO LOOP inside DO CONCURRENT nvc, nvc++ and nvfortran	4	561	December 30, 2020
Multi-GPU Fortran OpenACC and OpenMP Legacy PGI Compilers	2	2692	October 26, 2018
OpenACC no parallelisation with ta=multicore Legacy PGI Compilers	7	1104	December 1, 2023
Parallel (async) execution of an OpenACC loop on multiple GPUs is not working when added a nested seq loop (Fortran) nvc, nvc++ and nvfortran	2	906	November 18, 2022
Nesting a GPU loop inside a CPU loop? nvc, nvc++ and nvfortran	11	1267	August 27, 2021
NVFORTRAN-F-0000-Internal compiler error. gen_llvm_expr(): no incoming ili 0 nvc, nvc++ and nvfortran	7	94	December 25, 2024
Under Nvfortran 25.3 -stdpar=gpu -acc=gpu -gpu=mem:separate -O3 is still slow nvc, nvc++ and nvfortran	6	120	July 26, 2025

How to insert openacc multicore parallel loops into nvfortran GPU code

Related topics