Acc_attach(pointer) does not work in fortran openacc

honggangwang1979 · December 21, 2023, 4:09pm

Hi, look at the code below, it looks like acc_attach() does not work:

 1        MODULE m_fields
  2       
  3          real*8, allocatable, dimension(:,:,:), target :: qv
  4          real*8, pointer, dimension(:,:,:) :: p_qv
  5          !$acc declare create(p_qv, qv)
  6          
  7          contains
  8                  subroutine fill_qv(n)
  9                          !$acc routine seq
 10                          
 11                         INTEGER :: i,j,k
 12                         INTEGER, value  :: n
 13                                 do i=1, n
 14                                         do j=1, n
 15                                                 do k=1,n
 16                                                  p_qv(i,j,k)=i+j+k
 17                                                  !qv(i,j,k)=i+j+k
 18                                                 end do
 19                                         end do
 20                                 end do
 21 
 22                  end subroutine fill_qv
 23 
 24        END MODULE m_fields
 25 
 26 
 27        program test_pointer
 28                use m_fields
 29                use openacc
 30                
 31                integer :: i, j, k
 32                integer :: n=10
 33                !!$acc declare copyin(n)
 34                
 35                allocate(qv(n,n,n))
 36                
 37                !point p_qv to qv in the host 
 38                p_qv => qv
 39 
 40                !point p_qv to qv in the device
 41                call acc_attach(p_qv)
 42 
 43                !!$acc serial present(p_qv)
 44                !$acc kernels present(p_qv)
 45 
 46                call fill_qv(n)
 47 
 48                !$acc end kernels
 49                !!$acc end serial 
 50 
 51                !$acc update host(qv)
 52 
 53                print*, qv(n,n,n)
 54 
 55                DEALLOCATE ( qv )
 56 
 57        end program

compile and run:

TP# nvfortran  -g -pg -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,managed,implicitsections -stdpar -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -cpp -acc -o test_pointer_7 test_pointer_7.f90
fill_qv:
      8, Generating acc routine seq
         Generating NVIDIA GPU code
test_pointer:
     44, Accelerator serial kernel generated
         Generating NVIDIA GPU code
     51, Generating update self(qv(:,:,:))
(rapids) TP# ./*7
libcupti.so not found
call to cuEventSynchronize returned error 700: Illegal address during kernel execution

Accelerator Kernel Timing data
(unknown)
  (unknown)  NVIDIA  devicenum=0
    time(us): 89
    0: upload reached 3 times
        0: data copyin transfers: 3
             device time(us): total=89 max=42 min=19 avg=29
TP/test_pointer_7.f90
  test_pointer  NVIDIA  devicenum=0
    time(us): 0
    44: compute region reached 1 time
        44: kernel launched 1 time
            grid: [1]  block: [1]
             device time(us): total=0 max=0 min=0 avg=0

However, it works if I directly use the allocable array variable in fill_qv. This is a small example, in my larger code, I cannot avoid using pointers.

Thanks.

MatColgrove · December 21, 2023, 5:45pm

The error is because you’re compiling with “-gpu=managed”. While I’m not sure if this is expected to work or not, the issue appears to be when use a pointer module variable with declare create and then trying to attach it to a unified memory address as opposed to a device address. I’d normally ask engineering, but NVIDIA gave everyone the day off today and tomorrow so no one’s around to ask. I’m on vacation as well till next year.

The simplest work around is to remove “-gpu=managed” as well as “-stdpar” which implicitly includes “managed”.

% nvfortran -g -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,implicitsections -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -acc test1.F90 ; a.out
fill_qv:
      8, Generating acc routine seq
         Generating NVIDIA GPU code
test_pointer:
     40, Generating enter data attach(p_qv)
     42, Accelerator serial kernel generated
         Generating NVIDIA GPU code
     45, Generating update self(qv(:,:,:))
    30.00000000000000

If later you are also adding Fortran STDPAR (i.e. DO CONCURRENT), add the flag “nomanged” so the -stdpar doesn’t enable managed.

% nvfortran -g -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,implicitsections,nomanaged -stdpar -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -acc test1.F90 ; a.out
fill_qv:
      8, Generating acc routine seq
         Generating NVIDIA GPU code
test_pointer:
     40, Generating enter data attach(p_qv)
     42, Accelerator serial kernel generated
         Generating NVIDIA GPU code
     45, Generating update self(qv(:,:,:))
    30.00000000000000

If you really do need managed, then re-organize the code so the declare create isn’t needed by moving the kernels region into “fill_qv” rather than having it as a device routine. In this case, you can also remove the data directives as well as the attach.

% cat test2.F90
MODULE m_fields

  real*8, allocatable, dimension(:,:,:), target :: qv
  real*8, pointer, dimension(:,:,:) :: p_qv

  contains
          subroutine fill_qv(n)
                 INTEGER :: i,j,k
                 INTEGER, value  :: n
!$acc kernels loop collapse(3) present(p_qv)
                         do i=1, n
                                 do j=1, n
                                         do k=1,n
                                          p_qv(i,j,k)=i+j+k
                                          !qv(i,j,k)=i+j+k
                                         end do
                                 end do
                         end do

          end subroutine fill_qv

END MODULE m_fields


program test_pointer
        use m_fields
        use openacc

        integer :: i, j, k
        integer :: n=10

        allocate(qv(n,n,n))

        !point p_qv to qv in the host
        p_qv => qv
        call fill_qv(n)
        print*, qv(n,n,n)

        DEALLOCATE ( qv )

end program

% nvfortran -g -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,implicitsections,managed -stdpar -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -acc test2.F90 ; a.out
fill_qv:
     10, Generating present(p_qv(:,:,:))
     11, Loop is parallelizable
     12, Loop is parallelizable
     13, Loop is parallelizable
         Generating NVIDIA GPU code
         11, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost
         12,   ! blockidx%x threadidx%x collapsed
         13,   ! blockidx%x threadidx%x collapsed
    30.00000000000000

honggangwang1979 · December 21, 2023, 7:33pm

Thank you Mat.

In my case, I have to use managed and acc routine. My work around is to pass the pointer to the subroutine, and comment out the acc declare created in the module. Whether keeping acc_attach() does not matter, see below:
1 MODULE m_fields
2
3 real8, allocatable, dimension(:,:,:), target :: qv
4 real8, pointer, dimension(:,:,:) :: p_qv
5 !!$acc declare create(p_qv, qv)
6
7 contains
8 subroutine fill_qv(n,p_qv)
9 !$acc routine seq
10 real8, pointer, dimension(:,:,:) :: p_qv
11 INTEGER :: i,j,k
12 INTEGER, value :: n
13 do i=1, n
14 do j=1, n
15 do k=1,n
16 p_qv(i,j,k)=i+j+k
17 !qv(i,j,k)=i+j+k
18 end do
19 end do
20 end do
21
22 end subroutine fill_qv
23
24 END MODULE m_fields
25
26
27 program test_pointer
28 use m_fields
29 use openacc
30
31 integer :: i, j, k
32 integer :: n=10
33 !!$acc declare copyin(n)
34
35 allocate(qv(n,n,n))
36
37 !point p_qv to qv in the host
38 p_qv => qv
39
40 !point p_qv to qv in the device
41 !call acc_attach(p_qv)
42
43 !!$acc serial present(p_qv)
44 !$acc kernels present(p_qv)
45
46 !call fill_qv(n)
47 call fill_qv(n,p_qv)
48
49 !$acc end kernels
50 !!$acc end serial
51
52 !$acc update host(qv)
53
54 print, qv(n,n,n)
55
56 DEALLOCATE ( qv )
57
58
59 end program

TP# nvfortran -g -pg -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,managed,implicitsections -stdpar -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -cpp -acc -o test_pointer_8 test_pointer_8.f90
fill_qv:
8, Generating acc routine seq
Generating NVIDIA GPU code
test_pointer:
44, Generating present(p_qv(:,:,:))
Accelerator serial kernel generated
Generating NVIDIA GPU code
52, Generating update self(qv(:,:,:))

TP# ./*8
libcupti.so not found
30.00000000000000

Accelerator Kernel Timing data
/notebooks/ParallelProgrammingWithOpenACC/Chapter13/example_openacc11/TP/test_pointer_8.f90
test_pointer NVIDIA devicenum=0
time(us): 41
44: compute region reached 1 time
44: kernel launched 1 time
grid: [1] block: [1]
elapsed time(us): total=2,817 max=2,817 min=2,817 avg=2,817
44: data region reached 4 times
44: data copyin transfers: 1
device time(us): total=26 max=26 min=26 avg=26
49: data copyout transfers: 1
device time(us): total=15 max=15 min=15 avg=15
52: update directive reached 1 time

Now I am a little bit confused about -gpu=managed - when should we use it or not, what is the effects of it on acc routines and module allocable variables (and as members of a derived type), and what is the side effects if we do not use it, and what are the alternative ways. For example, if we do not use managed memory, should we always explicitly define the device variable by acc (enter) data clause? It looks like we need. In the above code, we use present(p_qv), this is ok with -gpu=managed. If we do not use managed memory, we will get run time error (the compilation is ok):
FATAL ERROR: data in PRESENT clause was not found on device 1: name=p_qv(:,:,:) host:0x12557b0
Same thing with acc data copy(p_qv). with managed memory, we can omit copy(p_qv) and only use acc data, the compiler will implicitly copy the variables p_qv into the device:
TP# nvfortran -g -pg -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,managed,implicitsections -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -cpp -acc -o test_pointer_8 test_pointer_8.f90
fill_qv:
8, Generating acc routine seq
Generating NVIDIA GPU code
test_pointer:
47, Accelerator serial kernel generated
Generating NVIDIA GPU code
Generating implicit copy(p_qv(:,:,:)) [if not already present]

and the result is correct:
TP# ./*8
libcupti.so not found
30.00000000000000

Accelerator Kernel Timing data
/notebooks/ParallelProgrammingWithOpenACC/Chapter13/example_openacc11/TP/test_pointer_8.f90
test_pointer NVIDIA devicenum=0
time(us): 35
47: compute region reached 1 time
47: kernel launched 1 time
grid: [1] block: [1]
elapsed time(us): total=2,652 max=2,652 min=2,652 avg=2,652
47: data region reached 2 times
47: data copyin transfers: 1
device time(us): total=18 max=18 min=18 avg=18
52: data copyout transfers: 1
device time(us): total=17 max=17 min=17 avg=17

However, if I do not use managed memory, the compilation is same:
TP# nvfortran -g -pg -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,implicitsections -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -cpp -acc -o test_pointer_8 test_pointer_8.f90
fill_qv:
8, Generating acc routine seq
Generating NVIDIA GPU code
test_pointer:
47, Accelerator serial kernel generated
Generating NVIDIA GPU code
Generating implicit copy(p_qv(:,:,:)) [if not already present]

but the run fails:
/TP# ./*8
libcupti.so not found
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.5, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x454fd0 device:0x7f50f0cfa000 size:224 presentcount:1+0 line:47 name:p_qv$sd(:)
host:0x653600 device:0x7f50f0cfa200 size:8 presentcount:1+0 line:47 name:p_qv
allocated block device:0x7f50f0cfa000 size:512 thread:1
allocated block device:0x7f50f0cfa200 size:512 thread:1

Present table errors:
p_qv(:,:,:) lives at 0x653600 size 8000 partially present in
host:0x653600 device:0x7f50f0cfa200 size:8 presentcount:1+0 line:47 name:p_qv file:/notebooks/ParallelProgrammingWithOpenACC/Chapter13/example_openacc11/TP/test_pointer_8.f90
FATAL ERROR: variable in data clause is partially present on the device: name=p_qv(:,:,:)
file:/notebooks/ParallelProgrammingWithOpenACC/Chapter13/example_openacc11/TP/test_pointer_8.f90 test_pointer line:47

Accelerator Kernel Timing data
/notebooks/ParallelProgrammingWithOpenACC/Chapter13/example_openacc11/TP/test_pointer_8.f90
test_pointer NVIDIA devicenum=0
time(us): 24
47: data region reached 1 time
47: data copyin transfers: 2
device time(us): total=24 max=18 min=6 avg=12

My feelings are that there are some trick things with acc declare create. Many time the errors come from it.

Thank you so much Mat, and have a great holiday!

Sincerely,

Honggang Wang.

Topic		Replies	Views
Implicit data copy to device for allocated arrays using compilation option -stdpar=gpu nvc, nvc++ and nvfortran	11	678	May 31, 2023
Openacc fortran pointer multi-dimension array Legacy PGI Compilers	3	658	June 9, 2023
OpenACC: FORTRAN dynamic pointers nvc, nvc++ and nvfortran	5	820	March 12, 2021
In OpenACC Fortran, 1. how to use private pointer variables, 2. How to deal with derived type variables with allocable variables nvc, nvc++ and nvfortran	5	463	December 20, 2023
Problem with '!$acc update device' in omp+acc fortran code Legacy PGI Compilers	10	6601	October 3, 2018
The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU nvc, nvc++ and nvfortran	14	31	December 28, 2024
OpenACC FORTRAN pointer how-to question nvc, nvc++ and nvfortran	5	1149	December 19, 2023
Using Fortran derived types and cuBLAS Legacy PGI Compilers	19	12049	June 24, 2016
Using classes in openACC nvc, nvc++ and nvfortran	11	732	March 20, 2023
OpenACC Accelerator restriction: call to 'function' with no acc routine information nvc, nvc++ and nvfortran	9	507	November 26, 2024

Acc_attach(pointer) does not work in fortran openacc

Related topics