OpenACC routine call inside OpenMP parallel loop

danxpy · May 21, 2021, 9:15pm

Hi,
I get an error when trying to compile a progrma with an enclosing openmp directives within which
I have an OpenACC kernel that calls a routine.

program test

   IMPLICIT NONE
   INTEGER :: k, t
   REAL :: x

   x = 0 
   
!$omp parallel do private(t)
   DO t = 1, 20
!$acc kernels loop private(x,k)
        DO k = 1, 2
           !x = x + 1
           CALL hello(x)
        ENDDO
!$acc end kernels
   ENDDO
!$omp end parallel do

   PRINT*, x

CONTAINS

   SUBROUTINE hello(x)
!$acc routine seq
      REAL, INTENT(INOUT) :: x
      x = x + 1 
   END SUBROUTINE

end program test

The error i get when compiling

$ pgf95 -O3 -mp -acc -Minfo=accel -ta=tesla:cc60,cuda10.1 ompacc.f90 -o kernel_v2
test:
     12, Loop is parallelizable
         Generating Tesla code
         12, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
hello:
     24, Generating acc routine seq
         Generating Tesla code
nvvmCompileProgram error: 9.
Error: /tmp/pgaccrOdgZRIJzdyN.gpu (63, 14): parse invalid forward reference to function '_hello_' with wrong type!
PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (ompacc.f90: 27)
  0 inform,   0 warnings,   1 severes, 0 fatal for

Either removing the openmp directives or removing the routine call compiles the code.

MatColgrove · May 21, 2021, 10:17pm

Two problems here. First, “x” needs to be private to the OpenMP loop and second using contained device subroutines is problematic.

Contained subroutines get passed a hidden pointer to the parent’s stack. Since this is a host stack pointer, accessing it on the device will cause issues. It sometimes “works” when the contained routine doesn’t access the parent’s variables, but within an OpenMP region, each thread will have a different stack and this is what’s causing the above error. It’s recommended, in general not just within OpenMP, to not use contained subroutines in device code.

Example:

    % cat test.f90
    program test

       IMPLICIT NONE
       INTEGER :: k, t
       REAL :: x
    !$acc routine(hello) seq
       x = 0

    !$omp parallel do private(x)
       DO t = 1, 20
    !$acc kernels loop private(x)
            DO k = 1, 2
               !x = x + 1
               CALL hello(x)
            ENDDO
    !$acc end kernels
       ENDDO
    !$omp end parallel do

       PRINT*, x

    end program test

       SUBROUTINE hello(x)
    !$acc routine seq
          REAL, INTENT(INOUT) :: x
          x = x + 1
       END SUBROUTINE

    % nvfortran -acc -mp test.f90 -Minfo=accel ; a.out
    test:
         12, Loop is parallelizable
             Generating Tesla code
             12, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
    hello:
         24, Generating acc routine seq
             Generating Tesla code
        0.000000

Hope this helps,
Mat

danxpy · May 21, 2021, 10:39pm

Hi Mat,

Thanks a lot!

Can I use a module to contain device subroutines instead ?
It seems the following code does not compilation issues, but it maybe that this is the case you mentioned
in which it sometimes happens to work…

MODULE kernels

CONTAINS
   SUBROUTINE hello(x)
!$acc routine seq
      REAL, INTENT(INOUT) :: x
      x = x + 1 
   END SUBROUTINE

END MODULE kernels

program test
   
!$acc routine(hello) seq
   USE kernels

   IMPLICIT NONE
   INTEGER :: k, t
   REAL :: x

   x = 0 
   
!$omp parallel do private(x,t)
   DO t = 1, 20
!$acc kernels loop private(x,k)
        DO k = 1, 2
           CALL hello(x)
        ENDDO
!$acc end kernels
   ENDDO
!$omp end parallel do

   PRINT*, x

end program test

regards,
Daniel

MatColgrove · May 21, 2021, 11:01pm

Hi Daniel,

The original issue is is with a contained subroutine within another subroutine, which would use the host’s stack to pass the parent’s variables (via a hidden pointer).

Using a module subroutine is actually preferred since you’ll then have an implicit interface to the routine. An interface isn’t needed for this case, but would be if you were passing in an assumed shape array. Module variables are stored in static (global) memory which can be accessed on the device if you add the variables to a “declare” directive in order to create the device copy.

-Mat

danxpy · May 22, 2021, 12:05am

Hi again,

This is a small part of a bigger code I am trying to debug by extracting small examples.
I run into a problem after replacing x in the previous code with an allocatable array.
It compiles fine but crashes when running it. Use of allocatable array is the cause of the problem because without it it does not crash.

MODULE kernels

   IMPLICIT NONE

CONTAINS

   SUBROUTINE hello(arr)
!$acc routine worker
      REAL, INTENT(INOUT), DIMENSION(16,16) :: arr 
      INTEGER :: I
      DO I = 1, 4
        arr(I,I) = arr(I,I) + 1 
      END DO
   END SUBROUTINE

END MODULE kernels

program test
   
   USE kernels
   IMPLICIT NONE
   INTEGER :: k, t
   REAL, ALLOCATABLE, DIMENSION( :,:,:,: ) :: arr 

   ALLOCATE(arr(16,16,20,20))
   arr = 0 
!$acc enter data copyin(arr)
   PRINT*, arr(:,1,1,1)
   
!$omp parallel do
   DO t = 1, 20
!$acc kernels
        DO k = 1, 20
           CALL hello(arr(:,:,k,t))
        ENDDO
!$acc end kernels
   ENDDO
!$omp end parallel do

!$acc update self(arr)
   PRINT*, arr(:,1,1,1)

end program test

Compiling and Running as

$ pgf95 -g -mp -acc -Minfo=accel -ta=tesla:cc60,cuda10.1 ompacc.f90 
$ OMP_NUM_THREADS=1 ./a.out
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Failing in Thread:1
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Run it through cuda-gdb too

 CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0xef45c8 (ompacc.f90:13)

Thread 18 "kernel_v2" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x0000000000ef45d0 in kernels_hello_ () at ompacc.f90:13
13	        arr(I,I) = arr(I,I) + 1

danxpy · May 22, 2021, 3:11am

It looks like the problem was that I need to specifically add a present(arr) in the kernels loop for some reason.
I also tried a default(present) on the kernels loop, but that did not fix the crash.
I guess somehow openacc is loosing track the “enter data(arr)” statement that made the array available on the device all the time.

MatColgrove · May 24, 2021, 4:37pm

I guess somehow openacc is loosing track the “enter data(arr)” statement that made the array available on the device all the time

It’s because “arr” is only used as an argument to the call so compiler doesn’t include it in the list of variables it needs to check in a “default(present)” clause. If you used the array in loop body itself, then it would have worked as expected. Something like:

!$omp parallel do private(k)
   DO t = 1, 20
!$acc kernels default(present)
        DO k = 1, 20
           arr(:,:,k,t) = 2
           CALL hello(arr(:,:,k,t))
        ENDDO
!$acc end kernels
   ENDDO
!$omp end parallel do

Topic		Replies	Views
Parallelizing with a fortran routine Legacy PGI Compilers	4	3171	December 13, 2019
Openacc fortran acc routine error [nvlink error : undefined reference to 'subroutine_name_' in 'file_name'] Legacy PGI Compilers	4	1605	March 3, 2023
acc routine and Fortran Legacy PGI Compilers	6	14286	March 13, 2014
Accelerator restriction: unsupported call to ... Legacy PGI Compilers	6	9486	January 30, 2013
With no acc routine info when calling subroutines from other fortran files Legacy PGI Compilers	2	2028	October 29, 2019
OpenACC_Fortran_Command terminated by signal 11 Legacy PGI Compilers	1	766	March 30, 2021
Compiling and linking OpenACC in different files Legacy PGI Compilers	1	3857	March 11, 2014
Call in OpenACC region to procedure 'pgf90_copy_f90_argl' Legacy PGI Compilers	10	11516	July 5, 2017
Dealing with allocatable arrays with OpenACC Legacy PGI Compilers	8	2110	November 30, 2020
OpenACC, Procedures called in a compute region must have acc routine information (fortran) Legacy PGI Compilers	3	1360	July 24, 2024

OpenACC routine call inside OpenMP parallel loop

Related topics