Acc_attach(pointer) does not work in fortran openacc

Hi, look at the code below, it looks like acc_attach() does not work:

 1        MODULE m_fields
  2       
  3          real*8, allocatable, dimension(:,:,:), target :: qv
  4          real*8, pointer, dimension(:,:,:) :: p_qv
  5          !$acc declare create(p_qv, qv)
  6          
  7          contains
  8                  subroutine fill_qv(n)
  9                          !$acc routine seq
 10                          
 11                         INTEGER :: i,j,k
 12                         INTEGER, value  :: n
 13                                 do i=1, n
 14                                         do j=1, n
 15                                                 do k=1,n
 16                                                  p_qv(i,j,k)=i+j+k
 17                                                  !qv(i,j,k)=i+j+k
 18                                                 end do
 19                                         end do
 20                                 end do
 21 
 22                  end subroutine fill_qv
 23 
 24        END MODULE m_fields
 25 
 26 
 27        program test_pointer
 28                use m_fields
 29                use openacc
 30                
 31                integer :: i, j, k
 32                integer :: n=10
 33                !!$acc declare copyin(n)
 34                
 35                allocate(qv(n,n,n))
 36                
 37                !point p_qv to qv in the host 
 38                p_qv => qv
 39 
 40                !point p_qv to qv in the device
 41                call acc_attach(p_qv)
 42 
 43                !!$acc serial present(p_qv)
 44                !$acc kernels present(p_qv)
 45 
 46                call fill_qv(n)
 47 
 48                !$acc end kernels
 49                !!$acc end serial 
 50 
 51                !$acc update host(qv)
 52 
 53                print*, qv(n,n,n)
 54 
 55                DEALLOCATE ( qv )
 56 
 57        end program

compile and run:

TP# nvfortran  -g -pg -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,managed,implicitsections -stdpar -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -cpp -acc -o test_pointer_7 test_pointer_7.f90
fill_qv:
      8, Generating acc routine seq
         Generating NVIDIA GPU code
test_pointer:
     44, Accelerator serial kernel generated
         Generating NVIDIA GPU code
     51, Generating update self(qv(:,:,:))
(rapids) TP# ./*7
libcupti.so not found
call to cuEventSynchronize returned error 700: Illegal address during kernel execution

Accelerator Kernel Timing data
(unknown)
  (unknown)  NVIDIA  devicenum=0
    time(us): 89
    0: upload reached 3 times
        0: data copyin transfers: 3
             device time(us): total=89 max=42 min=19 avg=29
TP/test_pointer_7.f90
  test_pointer  NVIDIA  devicenum=0
    time(us): 0
    44: compute region reached 1 time
        44: kernel launched 1 time
            grid: [1]  block: [1]
             device time(us): total=0 max=0 min=0 avg=0

However, it works if I directly use the allocable array variable in fill_qv. This is a small example, in my larger code, I cannot avoid using pointers.

Thanks.

The error is because you’re compiling with “-gpu=managed”. While I’m not sure if this is expected to work or not, the issue appears to be when use a pointer module variable with declare create and then trying to attach it to a unified memory address as opposed to a device address. I’d normally ask engineering, but NVIDIA gave everyone the day off today and tomorrow so no one’s around to ask. I’m on vacation as well till next year.

The simplest work around is to remove “-gpu=managed” as well as “-stdpar” which implicitly includes “managed”.

% nvfortran -g -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,implicitsections -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -acc test1.F90 ; a.out
fill_qv:
      8, Generating acc routine seq
         Generating NVIDIA GPU code
test_pointer:
     40, Generating enter data attach(p_qv)
     42, Accelerator serial kernel generated
         Generating NVIDIA GPU code
     45, Generating update self(qv(:,:,:))
    30.00000000000000

If later you are also adding Fortran STDPAR (i.e. DO CONCURRENT), add the flag “nomanged” so the -stdpar doesn’t enable managed.

% nvfortran -g -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,implicitsections,nomanaged -stdpar -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -acc test1.F90 ; a.out
fill_qv:
      8, Generating acc routine seq
         Generating NVIDIA GPU code
test_pointer:
     40, Generating enter data attach(p_qv)
     42, Accelerator serial kernel generated
         Generating NVIDIA GPU code
     45, Generating update self(qv(:,:,:))
    30.00000000000000

If you really do need managed, then re-organize the code so the declare create isn’t needed by moving the kernels region into “fill_qv” rather than having it as a device routine. In this case, you can also remove the data directives as well as the attach.

% cat test2.F90
MODULE m_fields

  real*8, allocatable, dimension(:,:,:), target :: qv
  real*8, pointer, dimension(:,:,:) :: p_qv

  contains
          subroutine fill_qv(n)
                 INTEGER :: i,j,k
                 INTEGER, value  :: n
!$acc kernels loop collapse(3) present(p_qv)
                         do i=1, n
                                 do j=1, n
                                         do k=1,n
                                          p_qv(i,j,k)=i+j+k
                                          !qv(i,j,k)=i+j+k
                                         end do
                                 end do
                         end do

          end subroutine fill_qv

END MODULE m_fields


program test_pointer
        use m_fields
        use openacc

        integer :: i, j, k
        integer :: n=10

        allocate(qv(n,n,n))

        !point p_qv to qv in the host
        p_qv => qv
        call fill_qv(n)
        print*, qv(n,n,n)

        DEALLOCATE ( qv )

end program
% nvfortran -g -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,implicitsections,managed -stdpar -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -acc test2.F90 ; a.out
fill_qv:
     10, Generating present(p_qv(:,:,:))
     11, Loop is parallelizable
     12, Loop is parallelizable
     13, Loop is parallelizable
         Generating NVIDIA GPU code
         11, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost
         12,   ! blockidx%x threadidx%x collapsed
         13,   ! blockidx%x threadidx%x collapsed
    30.00000000000000

Thank you Mat.

  1. In my case, I have to use managed and acc routine. My work around is to pass the pointer to the subroutine, and comment out the acc declare created in the module. Whether keeping acc_attach() does not matter, see below:
    1 MODULE m_fields
    2
    3 real8, allocatable, dimension(:,:,:), target :: qv
    4 real
    8, pointer, dimension(:,:,:) :: p_qv
    5 !!$acc declare create(p_qv, qv)
    6
    7 contains
    8 subroutine fill_qv(n,p_qv)
    9 !$acc routine seq
    10 real8, pointer, dimension(:,:,:) :: p_qv
    11 INTEGER :: i,j,k
    12 INTEGER, value :: n
    13 do i=1, n
    14 do j=1, n
    15 do k=1,n
    16 p_qv(i,j,k)=i+j+k
    17 !qv(i,j,k)=i+j+k
    18 end do
    19 end do
    20 end do
    21
    22 end subroutine fill_qv
    23
    24 END MODULE m_fields
    25
    26
    27 program test_pointer
    28 use m_fields
    29 use openacc
    30
    31 integer :: i, j, k
    32 integer :: n=10
    33 !!$acc declare copyin(n)
    34
    35 allocate(qv(n,n,n))
    36
    37 !point p_qv to qv in the host
    38 p_qv => qv
    39
    40 !point p_qv to qv in the device
    41 !call acc_attach(p_qv)
    42
    43 !!$acc serial present(p_qv)
    44 !$acc kernels present(p_qv)
    45
    46 !call fill_qv(n)
    47 call fill_qv(n,p_qv)
    48
    49 !$acc end kernels
    50 !!$acc end serial
    51
    52 !$acc update host(qv)
    53
    54 print
    , qv(n,n,n)
    55
    56 DEALLOCATE ( qv )
    57
    58
    59 end program

TP# nvfortran -g -pg -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,managed,implicitsections -stdpar -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -cpp -acc -o test_pointer_8 test_pointer_8.f90
fill_qv:
8, Generating acc routine seq
Generating NVIDIA GPU code
test_pointer:
44, Generating present(p_qv(:,:,:))
Accelerator serial kernel generated
Generating NVIDIA GPU code
52, Generating update self(qv(:,:,:))

TP# ./*8
libcupti.so not found
30.00000000000000

Accelerator Kernel Timing data
/notebooks/ParallelProgrammingWithOpenACC/Chapter13/example_openacc11/TP/test_pointer_8.f90
test_pointer NVIDIA devicenum=0
time(us): 41
44: compute region reached 1 time
44: kernel launched 1 time
grid: [1] block: [1]
elapsed time(us): total=2,817 max=2,817 min=2,817 avg=2,817
44: data region reached 4 times
44: data copyin transfers: 1
device time(us): total=26 max=26 min=26 avg=26
49: data copyout transfers: 1
device time(us): total=15 max=15 min=15 avg=15
52: update directive reached 1 time

  1. Now I am a little bit confused about -gpu=managed - when should we use it or not, what is the effects of it on acc routines and module allocable variables (and as members of a derived type), and what is the side effects if we do not use it, and what are the alternative ways. For example, if we do not use managed memory, should we always explicitly define the device variable by acc (enter) data clause? It looks like we need. In the above code, we use present(p_qv), this is ok with -gpu=managed. If we do not use managed memory, we will get run time error (the compilation is ok):
    FATAL ERROR: data in PRESENT clause was not found on device 1: name=p_qv(:,:,:) host:0x12557b0
    Same thing with acc data copy(p_qv). with managed memory, we can omit copy(p_qv) and only use acc data, the compiler will implicitly copy the variables p_qv into the device:
    TP# nvfortran -g -pg -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,managed,implicitsections -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -cpp -acc -o test_pointer_8 test_pointer_8.f90
    fill_qv:
    8, Generating acc routine seq
    Generating NVIDIA GPU code
    test_pointer:
    47, Accelerator serial kernel generated
    Generating NVIDIA GPU code
    Generating implicit copy(p_qv(:,:,:)) [if not already present]

and the result is correct:
TP# ./*8
libcupti.so not found
30.00000000000000

Accelerator Kernel Timing data
/notebooks/ParallelProgrammingWithOpenACC/Chapter13/example_openacc11/TP/test_pointer_8.f90
test_pointer NVIDIA devicenum=0
time(us): 35
47: compute region reached 1 time
47: kernel launched 1 time
grid: [1] block: [1]
elapsed time(us): total=2,652 max=2,652 min=2,652 avg=2,652
47: data region reached 2 times
47: data copyin transfers: 1
device time(us): total=18 max=18 min=18 avg=18
52: data copyout transfers: 1
device time(us): total=17 max=17 min=17 avg=17

However, if I do not use managed memory, the compilation is same:
TP# nvfortran -g -pg -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,implicitsections -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -cpp -acc -o test_pointer_8 test_pointer_8.f90
fill_qv:
8, Generating acc routine seq
Generating NVIDIA GPU code
test_pointer:
47, Accelerator serial kernel generated
Generating NVIDIA GPU code
Generating implicit copy(p_qv(:,:,:)) [if not already present]

but the run fails:
/TP# ./*8
libcupti.so not found
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.5, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x454fd0 device:0x7f50f0cfa000 size:224 presentcount:1+0 line:47 name:p_qv$sd(:)
host:0x653600 device:0x7f50f0cfa200 size:8 presentcount:1+0 line:47 name:p_qv
allocated block device:0x7f50f0cfa000 size:512 thread:1
allocated block device:0x7f50f0cfa200 size:512 thread:1

Present table errors:
p_qv(:,:,:) lives at 0x653600 size 8000 partially present in
host:0x653600 device:0x7f50f0cfa200 size:8 presentcount:1+0 line:47 name:p_qv file:/notebooks/ParallelProgrammingWithOpenACC/Chapter13/example_openacc11/TP/test_pointer_8.f90
FATAL ERROR: variable in data clause is partially present on the device: name=p_qv(:,:,:)
file:/notebooks/ParallelProgrammingWithOpenACC/Chapter13/example_openacc11/TP/test_pointer_8.f90 test_pointer line:47

Accelerator Kernel Timing data
/notebooks/ParallelProgrammingWithOpenACC/Chapter13/example_openacc11/TP/test_pointer_8.f90
test_pointer NVIDIA devicenum=0
time(us): 24
47: data region reached 1 time
47: data copyin transfers: 2
device time(us): total=24 max=18 min=6 avg=12

  1. My feelings are that there are some trick things with acc declare create. Many time the errors come from it.

Thank you so much Mat, and have a great holiday!

Sincerely,

Honggang Wang.