Runtime error with acc routine

Hi, I’m working with OpenACC’s routines feature on a more complicated piece of code and I’m facing with several runtime errors. Back to basics, I came across this toy code using the latest version of nvhpc I have available, but I’m getting errors. What could it be?

$ cat repro.F90

program openacc_subroutine
    implicit none
    integer, parameter :: N = 256, M = 128 
    real, allocatable :: A(:,:), B(:,:), C(:,:)
    integer :: i

    allocate(A(N,M), B(N,M), C(N,M))

    do i = 1, N
        A(i,:) = 1.0 * i 
        B(i,:) = 2.0 * i 
    end do
    C = 0.0 

    !$acc data copyin(A, B) copyout(C)
    !$acc parallel loop gang
    do i = 1, N
        call row_add(A(i,:), B(i,:), C(i,:), M)
    end do
    !$acc end parallel loop
    !$acc end data

    deallocate(A, B, C)
end program openacc_subroutine

subroutine row_add(x, y, z, m)
    !$acc routine seq
    implicit none
    integer, intent(in)  :: m
    real, intent(in)  :: x(m), y(m)
    real, intent(out) :: z(m)
    integer :: j

    !$acc data present(x,y,z)
    do j = 1, m
        z(j) = x(j) + y(j)
    end do
    !$acc end data
end subroutine row_add

$ nvfortran -O2 -acc=gpu -Minfo=acc repro.F90 -o repro
openacc_subroutine:
     15, Generating copyin(a(:,:)) [if not already present]
         Generating copyout(c(:,:)) [if not already present]
         Generating copyin(b(:,:)) [if not already present]
     16, Generating NVIDIA GPU code
         17, !$acc loop gang ! blockidx%x
         18, !$acc loop vector(128) ! threadidx%x
     18, Loop is parallelizable
row_add:
     26, Generating acc routine seq
         Generating NVIDIA GPU code

$ ./repro 
Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
 File: /path/to/bin/repro.F90
 Function: openacc_subroutine:1
 Line: 16

$ nvfortran --version

nvfortran 25.3-0 64-bit target on x86-64 Linux -tp sapphirerapids 

Thanks in advance!

1 Like

If we build this reproducer with debug info, cuda-gdb shows this:

Thread 1 "repro" received signal SIGSEGV, Segmentation fault.
0x00007fff7fa22ab4 in __pgi_uacc_cuda_dataup1 (devptr=devptr@entry=140719502501376, pbufinfo=pbufinfo@entry=0x0, hostptr=hostptr@entry=0x7ea178b20344488, offset=offset@entry=0, 
    size=size@entry=1, stride=stride@entry=1, elementsize=4, lineno=16, name=0x42a23e <.STATICS1+110> "tmp$r(:)", flags=4831842048, async=-1, dindex=1) at ../../src/cuda_dataup1.c:115
115	../../src/cuda_dataup1.c: No such file or directory.
(cuda-gdb) bt
#0  0x00007fff7fa22ab4 in __pgi_uacc_cuda_dataup1 (devptr=devptr@entry=140719502501376, pbufinfo=pbufinfo@entry=0x0, hostptr=hostptr@entry=0x7ea178b20344488, offset=offset@entry=0, 
    size=size@entry=1, stride=stride@entry=1, elementsize=4, lineno=16, name=0x42a23e <.STATICS1+110> "tmp$r(:)", flags=4831842048, async=-1, dindex=1) at ../../src/cuda_dataup1.c:115
#1  0x00007fff7ff9ac55 in __pgi_uacc_dataup1 (devptr=140719502501376, pbufinfo=pbufinfo@entry=0x0, hostptr=0x7ea178b20344488, offset=0, size=1, stride=1, elementsize=4, lineno=16, 
    name=0x42a23e <.STATICS1+110> "tmp$r(:)", flags=4831842048, async=-1, devid=<optimized out>) at ../../src/dataup1.c:59
#2  0x00007fff7ff9b42e in __pgi_uacc_dataupx (devptr=<optimized out>, pbufinfo=pbufinfo@entry=0x0, hostptr=hostptr@entry=0x7ea178b20344488, poffset=poffset@entry=0, dims=1, 
    desc=desc@entry=0x7fffffffb7a0, elementsize=4, lineno=16, name=0x42a23e <.STATICS1+110> "tmp$r(:)", flags=4831842048, async=-1, devid=1, uselock=1) at ../../src/dataupx.c:127
#3  0x00007fff7ff99459 in __pgi_uacc_dataonb (filename=0x42a1d0 <.STATICS1> "/gpfs/scratch/bsc32/bsc032677/src/1/repro.f90", funcname=0x42a200 <.STATICS1+48> "openacc_subroutine", 
    pdevptr=<optimized out>, hostptr=0x7ea178b20344488, hostptrptr=0x0, poffset=0, dims=<optimized out>, desc=0x7fffffffb7a0, elementsize=4, hostdescptr=0x0, hostdescsize=0, lineno=16, 
    name=0x42a23e <.STATICS1+110> "tmp$r(:)", pdtype=0x42a284 <.STATICS1+180>, flags=4831842048, async=-1, devid=1) at ../../src/dataonb.c:640
#4  0x0000000000403816 in openacc_subroutine () at repro.f90:16
(cuda-gdb) fr 4
#4  0x0000000000403816 in openacc_subroutine () at repro.f90:16
16	    !$acc parallel loop gang

Hi Rommel,

The problem here are the temporary arrays. Since you’re passing in non-contiguous array slices, the compiler must first copy these into temp arrays so the data is passed in as a contiguous block. There are known issues with temp arrays (Alexey reported a related case with SUM using “dim”) but I went ahead and reported this one as TPR #37812.

However, even if this was working correctly I’m concerned that the performance would be quite poor. Even on the CPU, the extra copy to the temp arrays hurts performance.

Hence as a work around and what should give better performance since there’s no need to temp arrays, I suggest you pass the full array to the subroutine. It does mean passing in “i” and “N” since this uses F77 style calling conventions, which is one drawback.

Note that the “data” directive in the device routine is being ignored since data movement can only be done from the host. Also, I’d make “row_add” a vector routine since it’s loop can be parallelized.

For example:

program openacc_subroutine
    implicit none
    integer, parameter :: N = 32, M = 16
    real, allocatable :: A(:,:), B(:,:), C(:,:)
    integer :: i
!$acc routine(row_add) vector
    allocate(A(N,M), B(N,M), C(N,M))

    do i = 1, N
        A(i,:) = 1.0 * i
        B(i,:) = 2.0 * i
    end do
    C = 0.0

    !$acc data copyin(A, B) copyout(C)
    !$acc parallel loop gang
    do i = 1, N
        call row_add(A, B, C, i, M, N)
    end do
    !$acc end parallel loop
    !$acc end data

    deallocate(A, B, C)
end program openacc_subroutine

subroutine row_add(x, y, z, i, m, n)
    !$acc routine vector
    implicit none
    integer  :: m, n, i
    real, intent(in)  :: x(n,m), y(n,m)
    real, intent(out) :: z(n,m)
    integer :: j

!$acc loop vector
    do j = 1, m
        z(i,j) = x(i,j) + y(i,j)
    end do
end subroutine row_add

Now I do understand that modifying your larger code is more challenging, so if you can’t modify the subroutine, an alternate work around is to manually create the temp arrays so they get privatized. This is effectively what the compiler is doing (sans the implicit privatization of the arrays). However the generated kernels is about 3x slower.

program openacc_subroutine
    implicit none
    integer, parameter :: N = 256, M = 128
    real, allocatable :: A(:,:), B(:,:), C(:,:)
    real, allocatable :: Atmp(:), Btmp(:), Ctmp(:)
    integer :: i
!$acc routine(row_add) vector

    allocate(A(N,M), B(N,M), C(N,M))
    allocate(Atmp(M), Btmp(M), Ctmp(M))

    do i = 1, N
        A(i,:) = 1.0 * i
        B(i,:) = 2.0 * i
    end do
    C = 0.0

    !$acc data copyin(A, B) copyout(C)
    !$acc parallel loop gang private(Atmp,Btmp,Ctmp)
    do i = 1, N
        Atmp=A(i,:)
        Btmp=B(i,:)
        Ctmp=C(i,:)
        call row_add(Atmp, Btmp, Ctmp, M)
        C(i,:) = Ctmp
    end do
    !$acc end parallel loop
    !$acc end data
    deallocate(A, B, C, Atmp, Btmp, Ctmp)
end program openacc_subroutine

subroutine row_add(x, y, z, m)
    !$acc routine vector
    implicit none
    integer, intent(in)  :: m
    real, intent(in)  :: x(m), y(m)
    real, intent(out) :: z(m)
    integer :: j

    !$acc loop vector
    do j = 1, m
        z(j) = x(j) + y(j)
    end do
end subroutine row_add

-Mat

Thanks for your answer Mat!, now is much clearer.

I’d also like to point out that, in the workaround you suggest, if I change this line:

call row_add(A, B, C, i, M, N)

with:

call row_add(A(1:N,1:M), B(1:N,1:M), C(1:N,1:M), i, M, N)

I immediately get an error:

Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
 File: /path/to/bin/repro-workaround.F90
 Function: openacc_subroutine:1
 Line: 16

I’d say the compiler would have enough information at that point to consider them equivalent, right?

Thanks for your answer Mat!
I think you are right that we see certain issues with handling temp arrays, and it is probably similar to some aspects of SUM(..,dim) issue. But I also see the similarity in one detail – it may appear that sometimes we don’t strictly need temporary arrays, but we still have them. We are thinking about such a possibility because in our real code we have issues even when slicing results in a contiguous array, or when slicing is redundant and in fact the whole array is passed (as Rommel showed above).

–Alexey

Since this program is using F77 calling conventions, that’s why I passed in just the pointer to the arrays. Pointers are more performant on the device.

Here, you’re passing in an array slice, and since the arrays are dynamically allocate, hence the size is not known at compile time, the compiler still needs to set-up temp arrays for the call.

I see two options. Make A, B, and C fixed size arrays, or pass them as assumed shape arrays. Though for assume shape, you’ll need an F90 interface either via an interface block or module.

Example 1:

program openacc_subroutine
    implicit none
    integer, parameter :: N = 32, M = 16
    real :: A(N,M), B(N,M), C(N,M)
    integer :: i
!$acc routine(row_add) vector

    do i = 1, N
        A(i,:) = 1.0 * i
        B(i,:) = 2.0 * i
    end do
    C = 0.0

    !$acc data copyin(A, B) copyout(C)
    !$acc parallel loop gang
    do i = 1, N
        call row_add(A(1:N,1:M), B(1:N,1:M), C(1:N,1:M), i, M, N)
    end do
    !$acc end parallel loop
    !$acc end data

end program openacc_subroutine

subroutine row_add(x, y, z, i, m, n)
    !$acc routine vector
    implicit none
    integer  :: m, n, i
    real, intent(in)  :: x(n,m), y(n,m)
    real, intent(out) :: z(n,m)
    integer :: j

!$acc loop vector
    do j = 1, m
        z(i,j) = x(i,j) + y(i,j)
    end do
end subroutine row_add

Example 2, using a module:

module foo
    integer, parameter :: N = 32, M = 16

contains

subroutine row_add(x, y, z, i)
    !$acc routine vector
    implicit none
    integer, value  ::  i
    real, intent(in)  :: x(:,:), y(:,:)
    real, intent(out) :: z(:,:)
    integer :: j

!$acc loop vector
    do j = 1, M
        z(i,j) = x(i,j) + y(i,j)
    end do
end subroutine row_add

end module foo

program openacc_subroutine
    use foo
    implicit none
    real, allocatable :: A(:,:), B(:,:), C(:,:)
    integer :: i

    allocate(A(N,M), B(N,M), C(N,M))

    do i = 1, N
        A(i,:) = 1.0 * i
        B(i,:) = 2.0 * i
    end do
    C = 0.0

    !$acc data copyin(A, B) copyout(C)
    !$acc parallel loop gang
    do i = 1, N
        call row_add(A(1:N,1:M), B(1:N,1:M), C(1:N,1:M), i)
    end do
    !$acc end parallel loop
    !$acc end data

    deallocate(A, B, C)
end program openacc_subroutine

Hi Mat,
Thanks for your workaround ideas. For now we are not inclined to apply such a serious refactoring effort to our real code.

Meanwhile, I’m thinking about your statement:


you’re passing in an array slice, and since the arrays are dynamically allocate, hence the size is not known at compile time, the compiler still needs to set-up temp arrays for the call.

I decided to check in practice if modern Fortran compilers are really doing a copy all the time for dynamic arrays (an alternative idea could be that the code that creates a temporary can be avoided at runtime when certain conditions are met, and the conditions could be calculated on every function call rather quickly).

test_slice.tar.gz (1.2 KB)

The CPU and GPU test programs that are attached here are checking the pointer value from inside of the subroutine in the following cases:


base addr:  c_loc(arr(1))
case 1:     call check_addr(arr, slice_addr1)
case 2:     call check_addr(arr(1:N), slice_addr2)
case 3:     call check_addr(arr(1:N:2), slice_addr3)

If the address is the same as the base address, then the array is the same as the original; no temp array.

Now, what we observe for CPU code with nvfortran:


 Base address of arr      :                 36377280
 Address seen in sub 1    :                 36377280
 Address seen in sub 2    :                 36377280
 Address seen in sub 3    :          140732524818768

so we have a temp array only in “case 3” (which is really needed and can’t be avoided).

For GPU code with Cray CCE 17 (AMD):


 Base device address of arr: 23270684360704
 Address seen in sub 1a    : 23270684360704
 Address seen in sub 2a    : 23270684360704
 Address seen in sub 1b    : 23270684360704
 Address seen in sub 2b    : 23270684360704
 Address seen in sub 3a    : 23270301071360
 Address seen in sub 3b    : 23270684360704

basically the same – we have temp arrays in “case 3” only. Sub-cases a and b are for “!$acc routine seq” and “!$acc routine vector”.

For GPU code with NVHPC:


 Base device address of arr:          139777025220608
 Address seen in sub 1a    :          139777025220608
 Address seen in sub 2a    :          139777025220608
 Address seen in sub 1b    :               8698984736
 Address seen in sub 2b    :               8698984736
 Address seen in sub 3a    :                        0
 Address seen in sub 3b    :                        0

here “case 3” crashes; also, we have temps in both “case 1” and “case 2”, but only when we deal with the “!$acc routine vector”.

Let’s leave alone the fact of the “case 3” crash (which is sad but a bit off-topic here). I’d say that nvfortran/gpu behavior seems to be a bit unusual (at least, if compared to nvfortran/cpu case and the Cray/gpu case); also, the statement that “the compiler still needs to set-up temp arrays for the call“ in such scenarios is doubtful.

Back in the 1980s during the development of the original PGI High-Performance Fortran (HPF), there was a design choice to have the compiler always pass arrays as contiguous. This meant more use of temp arrays but has the benefit that compiler can always assume contiguous memory when applying parallelization in HPF. Other compilers made different choices, but for us it basically means rewriting the entire compiler to change (more on this below).

Now often on the CPU, the compiler can optimize away the temp arrays, but it’s more challenging with F77 calling conventions and the GPU offload.

A few years ago since the PGI infrastructure became too limited in it’s ability to support newer Fortran standards like F2018, we partnered the LLVM community to create a new modern compiler which will eventually replace the current PGI based nvfortran. This is currently under development and not too far off for the beta release. However the initial release wont have OpenACC and why I didn’t mention it. That will come a bit after.

Also, we’re in the process of upstreaming OpenACC into the mainline Flang compiler. I’ve tested your sample code with the development version and it does not have the issues you’ve encountered. Though, I hesitate to mention it because I don’t have a timeline on when this will be deployed in a LLVM release.

I can’t advise you on what to do. If you’re unable to refactor your code, then you might try Cray’s OpenACC. I can’t recommend gfortran only because they don’t have good support for “kernels” and end up making these serial. Their support for “parallel” is fine but you’d need to refactor array syntax to explicit loops. You can wait for the new compilers, but I don’t have timelines for availability and inevitably there’s a break-in period so you might encounter a whole new set of issues.

2 Likes