Allocatable arrays made private for openmp leads to out of bounds (0x0) writes with nvfortran

seinsinnes · May 30, 2024, 7:28pm

I’m trying add openmp offload to a code with many scratch arrays but making them private doesn’t work properly.

Here is a reproduction of what I mean:

PROGRAM scratchTest

USE OMP_LIB
IMPLICIT NONE

INTEGER, PARAMETER  :: NN=4096
INTEGER, PARAMETER  :: NM=4096

INTEGER, PARAMETER  :: SCRATCHSIZE=10

REAL,ALLOCATABLE,DIMENSION(:,:) :: scratch
REAL,ALLOCATABLE,DIMENSION(:,:)   :: A

REAL :: st, et
INTEGER::I,J,K

ALLOCATE(A(NN,NM))

ALLOCATE(scratch(SCRATCHSIZE,SCRATCHSIZE))

A = 0.0D0
scratch = 0.0D0

st = omp_get_wtime()

!$OMP target enter data map(to:A)
    !$OMP target teams distribute parallel do private(scratch) collapse(2)
    DO j=1,NN
        DO i=1,NM
            DO k=1,SCRATCHSIZE
                    scratch(k,k)=j+i
            END DO
            A(i,j)=SUM(scratch)
        END DO
    END DO

    !$OMP END target teams distribute parallel do

!$OMP target exit data map(from:A)
et = omp_get_wtime()

PRINT *, " total time: ", (et - st), " s"
PRINT *,A(10,10)

END PROGRAM scratchTest

(This code doesn’t do anything useful. It just illustrates the problem.)

If I change scratch to be declared with a certain size at compile time it works fine but with allocatable it just crashes like this:

========= Invalid global write of size 8 bytes
========= at nvkernel_MAIN__F1L27_2_+0xaf0 in /home/ec2-user/private_problem/scratch_array.f90:31
========= by thread (56,0,0) in block (7,0,0)
========= Address 0x0 is out of bounds
========= and is 8,696,889,344 bytes before the nearest allocation at 0x206600000 of size 8,388,864 bytes
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x33137f]
========= in /lib64/libcuda.so.1
========= Host Frame:launchInternal in platform_cuda/hxCuda.c:3402 [0x4ac4c]
========= in /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/compilers/lib/libnvomp.so
…(etc)

Is making an allocatable array private in nvfortran with openmp offload not possible or am I missing something?

MatColgrove · May 30, 2024, 9:44pm

Yes, it’s possible, though with “distribute parallel do” private allocatable arrays need to be allocated on the device. Hence you’re likely encountering a heap overflow given the default device heap is small.

You have a few options:

Set the environment variable NV_ACC_CUDA_HEAPSIZE to a large value (I used 1GB)
Explicitly call the CUDA Fortran routine “cudaDeviceSetLimit” with “cudaLimitMallocHeapSize” and a large value. Note NV_ACC_CUDA_HEAPSIZE basically just calls this for you though applied it to all kernel launches. Calling it yourself allows to the change between kernels.
Switch to using the “loop” construct instead of “distribute parallel do”. With “loop”, the compiler doesn’t need to outline the region and therefor can allocate the private arrays before launching the kernels rather than in the kernel.

seinsinnes · May 31, 2024, 10:32pm

Thank you for the prompt reply.

I can confirm that those options work for my reproduction example.

However in the real code I don’t think I can set the heapsize high enough to make it work so perhaps I have too much to allocate and not enough memory.

Changing the “distribute parallel do” to the “loop” construct works perfectly in my simplified reproduction example but in the real code if I change to the “loop” construct the compiler crashes with a segfault.

MatColgrove · June 3, 2024, 3:26pm

Possible, if so, you may need to not collapse the loops or possibly investigate a blocking algorithm where only a portion of NN and NM are computed.

but in the real code if I change to the “loop” construct the compiler crashes with a segfault.

Are you able to provide a reproducing example? If so, I can report the issue to engineering so we can get it fixed.

Another possibility is to switch to using OpenACC “parallel” which is similar to OpenMP’s “loop” construct.

seinsinnes · June 5, 2024, 10:31am

I’m still working on finding a simple repro of the compiler segfault.

However, in the process I’ve come across another problem which is going to cause me headaches.

It seems that global variables can’t be made private successfully.

MODULE SUMTEST

IMPLICIT NONE
!$OMP declare target(scratch,A)
REAL,ALLOCATABLE,DIMENSION(:,:) :: scratch
REAL,ALLOCATABLE,DIMENSION(:,:)   :: A

END MODULE SUMTEST

PROGRAM scratchTest

    USE OMP_LIB
    USE SUMTEST
    IMPLICIT NONE

    INTEGER, PARAMETER  :: NN=512
    INTEGER, PARAMETER  :: NM=512

    INTEGER, PARAMETER  :: SCRATCHSIZE=10


    REAL :: st, et
    INTEGER::I,J,K

    ALLOCATE(A(NN,NM))

    ALLOCATE(scratch(SCRATCHSIZE,SCRATCHSIZE))

    A = 0.0D0
    scratch = 0.0D0

    st = omp_get_wtime()

    !$OMP target enter data map(to:A,scratch)
    !$OMP target teams loop private(scratch) collapse(2)
    DO j=1,NN
        DO i=1,NM
                scratch(1,1)=j+i
                A(i,j)=scratch(1,1)
        END DO
    END DO

    !$OMP END target teams loop

    !$OMP target exit data map(from:A)
    et = omp_get_wtime()

    PRINT *, " total time: ", (et - st), " s"
    PRINT *,A(10,10)
END PROGRAM scratchTest

This code prints random numbers at the end. It seems due to all threads writing to a single scratch array.

The original code prints the correct value of 200 and the end.

seinsinnes · June 5, 2024, 2:45pm

Also the with the loop directive I don’t seem to be able to pass a private array to a subroutine:

MODULE SUMTEST

IMPLICIT NONE
CONTAINS

SUBROUTINE SUM_SCRATCH(i,j,Ap,scratchp)
        IMPLICIT NONE
       !$OMP declare target
        INTEGER, INTENT(IN) :: i,j
        REAL,DIMENSION(:,:), INTENT(IN) :: scratchp
        REAL,DIMENSION(:,:), INTENT(INOUT)   :: Ap
        Ap(i,j)=SUM(scratchp)

END SUBROUTINE SUM_SCRATCH

END MODULE SUMTEST

PROGRAM scratchTest

    USE OMP_LIB
    USE SUMTEST
    IMPLICIT NONE

    INTEGER, PARAMETER  :: NN=4096
    INTEGER, PARAMETER  :: NM=4096

    INTEGER, PARAMETER  :: SCRATCHSIZE=10

    REAL,ALLOCATABLE,DIMENSION(:,:) :: scratch
    REAL,ALLOCATABLE,DIMENSION(:,:)   :: A

    REAL :: st, et
    INTEGER::I,J,K



    ALLOCATE(A(NN,NM))

    ALLOCATE(scratch(SCRATCHSIZE,SCRATCHSIZE))

    A = 0.0D0
    scratch = 0.0D0

    st = omp_get_wtime()

    !$OMP target enter data map(to:A, scratch)
    !$OMP target loop private(scratch) collapse(2)
        DO j=1,NN
            DO i=1,NM
                DO k=1,SCRATCHSIZE
                        scratch(k,k)=j+i
                END DO
                call SUM_SCRATCH(i,j,A,scratch)
            END DO
        END DO

        !$OMP END target loop

    !$OMP target exit data map(from:A)
    et = omp_get_wtime()

    PRINT *, " total time: ", (et - st), " s"
    PRINT *,A(10,10)
END PROGRAM scratchTest

NVFORTRAN-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): No device symbol for address reference (scratch_array2.f90: 47)

If I remove private(scratch) it compiles but of course the results are incorrect.

MatColgrove · June 5, 2024, 6:27pm

Correct, a variable can’t be both global and private since the compiler has no way of knowing which one you want. In this case, it’s using the global copy. To fix your code, remove “scratch” from the “declare target” and “enter data” directives.

Also the with the loop directive I don’t seem to be able to pass a private array to a subroutine:

You can, though we have a known limitation on passing private assumed shape arrays as arguments. The work around is to use:

call SUM_SCRATCH(i,j,A,scratch(:,:))

The issue is that in order to pass an assumed shape array, the descriptor is passed, but a private array doesn’t have a descriptor (it’s one block of memory partitioned between the threads). Using the slicing expression will create a descriptor at runtime as part of the call so it can then be passed.

It works with “distribute parallel do” since the private array is allocated separately by each thread on the device. So that’s another work around, but you’ll then need to contend with the original heap overflow problem.

seinsinnes · June 5, 2024, 6:42pm

Right. Yes both those fixes work for me.

In the actual code it may require the adding of some slicing expressions to subroutine calls.

seinsinnes · June 5, 2024, 6:44pm

I also now have a very simple repro of the crash during compiling:

MODULE GLOBALS
   IMPLICIT NONE
   REAL :: Z
   REAL, ALLOCATABLE, DIMENSION(:)::A

END MODULE GLOBALS

MODULE RECON
   USE GLOBALS

   IMPLICIT NONE

CONTAINS

   SUBROUTINE CRASH_REPRO()

      IMPLICIT NONE
      INTEGER::II

      ALLOCATE (A(10))
      Z = 7

      !$OMP target teams loop
      DO II = 1, 10000;
         A = Z;
      END DO
      !$OMP end target teams loop

      PRINT *, A(1)

   END SUBROUTINE CRASH_REPRO

END MODULE RECON

PROGRAM compilerCrashTest

   USE RECON
   IMPLICIT NONE

   CALL CRASH_REPRO()

END PROGRAM compilerCrashTest

seinsinnes · June 5, 2024, 7:01pm

The compiler crash is resolved by adding
!$OMP declare target(A)
(And with enter/exit to/from data transfer of A the output is as expected)

but just having the compiler seqfault isn’t very helpful in finding the offending variable.

MatColgrove · June 5, 2024, 7:36pm

Thanks! I created a problem report, TPR #35799, and sent it to engineering to investigate.

Another work around is to use slicing:

      DO II = 1, 10000;
         A(:) = Z;
      END DO

seinsinnes · June 5, 2024, 7:40pm

Thanks @MatColgrove

Now I’ll go try to apply all these bits of information to the real code and see if I can get it working.

seinsinnes · June 6, 2024, 5:47pm

Parsing an array to a subroutine without slice expression within loop construct also causes the compiler to segfault:

MODULE GLOBALS
   IMPLICIT NONE
   REAL, ALLOCATABLE, DIMENSION(:)::A

   CONTAINS
SUBROUTINE CRASH_TEST(Ap)
    IMPLICIT NONE
   !$OMP declare target
    REAL,DIMENSION(:), INTENT(INOUT)   :: Ap
    Ap(2)=10

END SUBROUTINE CRASH_TEST

END MODULE GLOBALS

PROGRAM compilerCrashTest

   USE GLOBALS
   IMPLICIT NONE

   IMPLICIT NONE
   INTEGER::II

   ALLOCATE (A(10))

   !$OMP target teams loop
   DO II = 1, 10000;
      CALL CRASH_TEST(A)
   END DO
   !$OMP end target teams loop

   PRINT *, A(1)

END PROGRAM compilerCrashTest

Probably the same root cause as the assignment operator crash.

seinsinnes · June 6, 2024, 5:58pm

Given the previous dissuasion i suppose there isn’t a way (with openmp directives) to deal with the situation of a global made private for the target loop but also referenced in a subroutine called within that loop.
E.g.

MODULE SUMTEST

IMPLICIT NONE
!$OMP declare target(A)
REAL,ALLOCATABLE,DIMENSION(:,:) :: scratch
REAL,ALLOCATABLE,DIMENSION(:,:)   :: A


CONTAINS
SUBROUTINE SUM_SCRATCH(i,j,Ap)
    IMPLICIT NONE
   !$OMP declare target
    INTEGER, INTENT(IN) :: i,j
    REAL,DIMENSION(:,:), INTENT(INOUT)   :: Ap
    Ap(i,j)=SUM(scratch)

END SUBROUTINE SUM_SCRATCH

END MODULE SUMTEST

PROGRAM scratchTest

    USE OMP_LIB
    USE SUMTEST
    IMPLICIT NONE

    INTEGER, PARAMETER  :: NN=512
    INTEGER, PARAMETER  :: NM=512

    INTEGER, PARAMETER  :: SCRATCHSIZE=10


    REAL :: st, et
    INTEGER::I,J,K

    ALLOCATE(A(NN,NM))

    ALLOCATE(scratch(SCRATCHSIZE,SCRATCHSIZE))

    A = 0.0D0
    scratch = 0.0D0

    st = omp_get_wtime()

    !$OMP target enter data map(to:A)
    !$OMP target teams loop private(scratch) collapse(2)
    DO j=1,NN
        DO i=1,NM
            DO k=1,SCRATCHSIZE
                scratch(k,k)=j+i
            END DO
            call SUM_SCRATCH(i,j,A)
        END DO
    END DO

    !$OMP END target teams loop

    !$OMP target exit data map(from:A)
    et = omp_get_wtime()

    PRINT *, " total time: ", (et - st), " s"
    PRINT *,A(10,10)
END PROGRAM scratchTest

Seems like only way would be to change the way the code actually allocates and/or passes the variables.

MatColgrove · June 6, 2024, 8:31pm

When targeting the CPU, I believe you’d use a “threadprivate” directive on the module variable. However, we were not able to implement the same functionality for device code.

Besides passing in the private array, another possibility is to manually privatize the array by adding two extra dimensions, size with NN and NM. The array itself would be global but each thread would have a unique scratch space.

If you don’t mind adding a bit of CUDA Fortran, you can add the “device” attribute to the array so that it’s only created on the device.

seinsinnes · June 6, 2024, 11:21pm

Yes, I thought that might be the answer.

You have hit the nail on the head. The code I’m working has been previously decorated with openmp targeting cpu and threadprivate was used to make all the global scratch arrays private to each thread but that can’t quite be replicated in the same way for gpu offload.

We’ll have to think about how we want to approach that problem.

seinsinnes · June 21, 2024, 11:10am

While continuing to modify this code this code. I ran across another compiler crash:

MODULE GLOBALS

IMPLICIT NONE
REAL :: scratchA

CONTAINS
SUBROUTINE SCRATCH_TEST(A)
    IMPLICIT NONE
    REAL, INTENT(INOUT)::A
    !$OMP declare target

    A = 7

END SUBROUTINE SCRATCH_TEST

END MODULE GLOBALS

PROGRAM scratchTest

    USE OMP_LIB
    USE GLOBALS
    IMPLICIT NONE
    INTEGER j

    !$OMP target teams loop
    DO j=1,10000
            call SCRATCH_TEST(scratchA)
    END DO

    !$OMP END target teams loop

END PROGRAM scratchTest

Rather than complaining that scratchA needs ‘omp target declare’ the compiler segfaults

MatColgrove · June 21, 2024, 5:28pm

Thanks siensinnes. It looks to be the same error, just under a different scenario, as #35799 since the compiler is seg faulting in the same spot. I’ve added this example to the report.

Topic		Replies	Views
Dealing with allocatable arrays with OpenACC Legacy PGI Compilers	8	1878	November 30, 2020
Acc loop with a FORTRAN subroutine call (acc routine worker) is not parallelized! Legacy PGI Compilers	18	1898	April 6, 2022
How to map private dynamic array to the GPU with OpenMP and nvc? nvc, nvc++ and nvfortran	20	119	January 31, 2025
Out of range error with openmp gpu offload nvc, nvc++ and nvfortran	10	990	February 1, 2023
Implicit data copy to device for allocated arrays using compilation option -stdpar=gpu nvc, nvc++ and nvfortran	11	685	May 31, 2023
Private array in acc loop nvc, nvc++ and nvfortran kernel	11	1271	December 14, 2020
Call in OpenACC region to procedure 'pgf90_copy_f90_argl' Legacy PGI Compilers	10	11404	July 5, 2017
Performance of private arrays declared in a separate module nvc, nvc++ and nvfortran	9	48	February 13, 2025
Using Fortran derived types and cuBLAS Legacy PGI Compilers	19	12050	June 24, 2016
Different GPU memory usage between OpenACC and OpenMP Offload nvc, nvc++ and nvfortran	10	874	April 28, 2023

Allocatable arrays made private for openmp leads to out of bounds (0x0) writes with nvfortran

Related topics