OpenACC routine call inside OpenMP parallel loop

Hi,
I get an error when trying to compile a progrma with an enclosing openmp directives within which
I have an OpenACC kernel that calls a routine.

program test

   IMPLICIT NONE
   INTEGER :: k, t
   REAL :: x

   x = 0 
   
!$omp parallel do private(t)
   DO t = 1, 20
!$acc kernels loop private(x,k)
        DO k = 1, 2
           !x = x + 1
           CALL hello(x)
        ENDDO
!$acc end kernels
   ENDDO
!$omp end parallel do

   PRINT*, x

CONTAINS

   SUBROUTINE hello(x)
!$acc routine seq
      REAL, INTENT(INOUT) :: x
      x = x + 1 
   END SUBROUTINE

end program test

The error i get when compiling

$ pgf95 -O3 -mp -acc -Minfo=accel -ta=tesla:cc60,cuda10.1 ompacc.f90 -o kernel_v2
test:
     12, Loop is parallelizable
         Generating Tesla code
         12, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
hello:
     24, Generating acc routine seq
         Generating Tesla code
nvvmCompileProgram error: 9.
Error: /tmp/pgaccrOdgZRIJzdyN.gpu (63, 14): parse invalid forward reference to function '_hello_' with wrong type!
PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (ompacc.f90: 27)
  0 inform,   0 warnings,   1 severes, 0 fatal for

Either removing the openmp directives or removing the routine call compiles the code.

Two problems here. First, “x” needs to be private to the OpenMP loop and second using contained device subroutines is problematic.

Contained subroutines get passed a hidden pointer to the parent’s stack. Since this is a host stack pointer, accessing it on the device will cause issues. It sometimes “works” when the contained routine doesn’t access the parent’s variables, but within an OpenMP region, each thread will have a different stack and this is what’s causing the above error. It’s recommended, in general not just within OpenMP, to not use contained subroutines in device code.

Example:

    % cat test.f90
    program test

       IMPLICIT NONE
       INTEGER :: k, t
       REAL :: x
    !$acc routine(hello) seq
       x = 0

    !$omp parallel do private(x)
       DO t = 1, 20
    !$acc kernels loop private(x)
            DO k = 1, 2
               !x = x + 1
               CALL hello(x)
            ENDDO
    !$acc end kernels
       ENDDO
    !$omp end parallel do

       PRINT*, x

    end program test

       SUBROUTINE hello(x)
    !$acc routine seq
          REAL, INTENT(INOUT) :: x
          x = x + 1
       END SUBROUTINE

    % nvfortran -acc -mp test.f90 -Minfo=accel ; a.out
    test:
         12, Loop is parallelizable
             Generating Tesla code
             12, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
    hello:
         24, Generating acc routine seq
             Generating Tesla code
        0.000000

Hope this helps,
Mat

1 Like

Hi Mat,

Thanks a lot!

Can I use a module to contain device subroutines instead ?
It seems the following code does not compilation issues, but it maybe that this is the case you mentioned
in which it sometimes happens to work…

MODULE kernels

CONTAINS
   SUBROUTINE hello(x)
!$acc routine seq
      REAL, INTENT(INOUT) :: x
      x = x + 1 
   END SUBROUTINE

END MODULE kernels

program test
   
!$acc routine(hello) seq
   USE kernels

   IMPLICIT NONE
   INTEGER :: k, t
   REAL :: x

   x = 0 
   
!$omp parallel do private(x,t)
   DO t = 1, 20
!$acc kernels loop private(x,k)
        DO k = 1, 2
           CALL hello(x)
        ENDDO
!$acc end kernels
   ENDDO
!$omp end parallel do

   PRINT*, x

end program test

regards,
Daniel

Hi Daniel,

The original issue is is with a contained subroutine within another subroutine, which would use the host’s stack to pass the parent’s variables (via a hidden pointer).

Using a module subroutine is actually preferred since you’ll then have an implicit interface to the routine. An interface isn’t needed for this case, but would be if you were passing in an assumed shape array. Module variables are stored in static (global) memory which can be accessed on the device if you add the variables to a “declare” directive in order to create the device copy.

-Mat

1 Like

Hi again,

This is a small part of a bigger code I am trying to debug by extracting small examples.
I run into a problem after replacing x in the previous code with an allocatable array.
It compiles fine but crashes when running it. Use of allocatable array is the cause of the problem because without it it does not crash.

MODULE kernels

   IMPLICIT NONE

CONTAINS

   SUBROUTINE hello(arr)
!$acc routine worker
      REAL, INTENT(INOUT), DIMENSION(16,16) :: arr 
      INTEGER :: I
      DO I = 1, 4
        arr(I,I) = arr(I,I) + 1 
      END DO
   END SUBROUTINE

END MODULE kernels

program test
   
   USE kernels
   IMPLICIT NONE
   INTEGER :: k, t
   REAL, ALLOCATABLE, DIMENSION( :,:,:,: ) :: arr 

   ALLOCATE(arr(16,16,20,20))
   arr = 0 
!$acc enter data copyin(arr)
   PRINT*, arr(:,1,1,1)
   
!$omp parallel do
   DO t = 1, 20
!$acc kernels
        DO k = 1, 20
           CALL hello(arr(:,:,k,t))
        ENDDO
!$acc end kernels
   ENDDO
!$omp end parallel do

!$acc update self(arr)
   PRINT*, arr(:,1,1,1)

end program test

Compiling and Running as

$ pgf95 -g -mp -acc -Minfo=accel -ta=tesla:cc60,cuda10.1 ompacc.f90 
$ OMP_NUM_THREADS=1 ./a.out
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Failing in Thread:1
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Run it through cuda-gdb too

 CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0xef45c8 (ompacc.f90:13)

Thread 18 "kernel_v2" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x0000000000ef45d0 in kernels_hello_ () at ompacc.f90:13
13	        arr(I,I) = arr(I,I) + 1

It looks like the problem was that I need to specifically add a present(arr) in the kernels loop for some reason.
I also tried a default(present) on the kernels loop, but that did not fix the crash.
I guess somehow openacc is loosing track the “enter data(arr)” statement that made the array available on the device all the time.

I guess somehow openacc is loosing track the “enter data(arr)” statement that made the array available on the device all the time

It’s because “arr” is only used as an argument to the call so compiler doesn’t include it in the list of variables it needs to check in a “default(present)” clause. If you used the array in loop body itself, then it would have worked as expected. Something like:

!$omp parallel do private(k)
   DO t = 1, 20
!$acc kernels default(present)
        DO k = 1, 20
           arr(:,:,k,t) = 2
           CALL hello(arr(:,:,k,t))
        ENDDO
!$acc end kernels
   ENDDO
!$omp end parallel do