In OpenACC Fortran, 1. how to use private pointer variables, 2. How to deal with derived type variables with allocable variables

Hello, in the following code, I use a derived type with allocable members. If not declare created it, there are 3 methods to work out the correct answer:
Method 1: use unstructured data clause, acc enter data and acc exit data.
Method 2: use structural data clause, acc data and acc end data
Method 3: use nothing. The compiler will automatically implicitly copy in what we need.
My first question is : what method is the best and why?

My second question is: How to use the private pointer variables in the kernel? as shown in my code, whether I use private(p_t, p_qv, p_qc) or not, the results are same. Also it looks like the competitive condition which we expect without private(p_t, p_qv, p_qc) has little effects on the performance. In both case, with or without, the compiler will not show that it has realized that we need private pointer variables in the kernel. So the compiler just silently creates private pointers in the kernel and does not let us know this?

My third question or observation is that, the 700 error problem will come when we declare create a derived type variable but do not explicitly copy its allocable members into the device.

Here is my code:

MODULE m_fields

    type pair

       REAL*8, ALLOCATABLE :: qv(:,:,:)
       REAL*8, ALLOCATABLE :: t(:,:,:)
     end type

     type (pair), target  :: pair_t_qv

    !!$acc declare create(pair_t_qv)

   END MODULE m_fields

   program test_pointer
           use m_fields

           real*8, allocatable,target  :: qc(:,:,:)
           !!$acc declare create(qc)
           real*8, pointer :: p_t => null()
           real*8, pointer :: p_qc => null(), p_qv => null()
           type (pair), pointer :: p_pair  => null()
           integer :: i, j, k
           integer :: n=100

           p_pair => pair_t_qv
           ALLOCATE ( p_pair%t(n,n,n) )
           ALLOCATE ( p_pair%qv(n,n,n) )

           ALLOCATE ( qc(n,n,n) )

           !!$acc data 

!Method 1: unstructural data region
!!$acc enter data copyin(p_pair)
!!$acc enter data copyin(p_pair%t, p_pair%qv)

           !Method 2: structural data region, each of the next line will work
           !!$acc data 
           !!$acc data copyin(p_pair,qc)

           !Mehod 3: remove all the data clause, the compiler will implicitly copy what it needs

           !$acc kernels
           !each of the next two lines will work
       !!$acc loop independent collapse(3) private(p_t, p_qv, p_qc)
           !$acc loop independent collapse(3) 
           DO j = 1,n
                     DO i = 1,n
                            DO k = 1, n
                                    !p_t => pair_t_qv%t(i,j,k)
                                    p_t => p_pair%t(i,j,k)
                                    p_t = i+j+k

                                    !p_qv => pair_t_qv%qv(i,j,k)
                                    p_qv => p_pair%qv(i,j,k)
                                    p_qv = i*j*k

                                    p_qc => qc(i,j,k)
                                    p_qc = p_t+p_qv

                            END DO
                    END DO
           END DO
           !$acc end kernels

!method 2
!!$acc end data

           !method 1 
           !!$acc exit data copyout(p_pair%t, p_pair%qv)
           !!$acc exit data delete(p_pair)

           print*, pair_t_qv%t(n,n,n)
           print*, pair_t_qv%qv(n,n,n)
           print*, qc(n,n,n)

           deallocate(qc)
           deallocate(pair_t_qv%t)
           deallocate(pair_t_qv%qv)
   end program

compiled with nvfortran -g -pg -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,managed,implicitsections -stdpar -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -cpp -acc -o test_pointer_4 test_pointer_4ok.f90

Thanks.

In order to accomplish this, you’ll need to do a deep copy.

If I’m using directives, I prefer using unstructured data regions since I can then match how the allocation is done on the host as well as having the host and device copies have the same scope and lifetime.

           p_pair => pair_t_qv
           ALLOCATE ( p_pair%t(n,n,n) )
           ALLOCATE ( p_pair%qv(n,n,n) )
!$acc enter data create(p_pair, p_pair%t, p_pair%qv)

Here “p_pair” is created on the device. Then “t” and “qv” are created and given they are children of “p_pair”, the OpenACC runtime implicitly does an “attach” (i.e. fill in the pointers to the child memory in the parent)

One caveat is that updates only perform shallow-copies, so you shouldn’t use:

!$acc update self(p_pair)
!$acc update device(p_pair)

since the host/device pointers to “t” and “qv” will get overwritten with the host/device pointer’s address. Instead, only update the children.

We do have a flag “-gpu=deepcopy” which in Fortran will perform this deep copy for you. In other words, all you need to use in the directives is the parent, i.e. “!$acc enter data(p_pair)” and “!$acc update device(p_pair)”. the runtime then implicitly traverses the parent’s structure copying the children. I personally don’t use this feature since the entire structure is copied and I prefer to be selective. But it is available if you want to try.

Another option is to use the flag “-gpu=managed”. Here memory is allocated in CUDA Unified Memory (UM) and the CUDA driver takes care of the data movement for you. This is the easiest method and sometimes the only viable option if the structure is very complex.

However static memory still needs to be manually managed via directives. So if the parent is static (i.e. not allocated), then you still need to put it in a data region. However the allocated children will be managed by the driver. If you are on a Grace-Hopper system, we just added the flag “-gpu=unified” where all data can be managed by the driver, including static data, and no data regions are needed. (This blog has more info on unified).

The only issues I have with UM is that CUDA Aware MPI can’t do GPU direct with unified memory and given UM is transferred via page faults during the execution of the kernel, when I look at profiles, it’s difficult to tell the performance of the kernel since the data movement gets mixed in.

My recommendation would to start with using managed memory (-gpu=managed) and then only go to data directives if needed. As I note in the blog, the data directives become a tuning option, but aren’t a requirement.

My second question is: How to use the private pointer variables in the kernel? as shown in my code, whether I use private(p_t, p_qv, p_qc) or not, the results are same. Also it looks like the competitive condition which we expect without private(p_t, p_qv, p_qc) has little effects on the performance. In both case, with or without, the compiler will not show that it has realized that we need private pointer variables in the kernel. So the compiler just silently creates private pointers in the kernel and does not let us know this?

Can you post the compiler feedback messages (-Minfo=accel) output for this section? The compiler should tell you if it’s implicitly privatizing these variables. While contextual, my first thought is that they would be implicitly copied, but it’s analysis might be detecting that they can be implicitly privatized.

My third question or observation is that, the 700 error problem will come when we declare create a derived type variable but do not explicitly copy its allocable members into the device.

Yes, this would be expected. Only a shallow copy is getting performed on the parent, so the children will have a host address, which in turn causes the 700 error since this address is not accessible on the device. Again, you can fix this by performing a manual deep copy, using true deep copy (i.e. -gpu=deepcopy), or CUDA Unified Memory (-gpu=managed).

-Mat

Thank you Mat.

  1. I will try the deep and manage option.
  2. as to the private pointer question, here is the output of the compilation without using private((p_t, p_qv, p_qc):
    nvfortran -g -pg -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,managed,implicitsections -stdpar -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -cpp -acc -o test_pointer_4 test_pointer_4ok.f90
    test_pointer:
    56, Generating implicit copyin(p_qv,p_qc,p_t) [if not already present]
    Generating implicit copy(qc(:,:,:),p_pair) [if not already present]
    60, Loop is parallelizable
    61, Loop is parallelizable
    62, Loop is parallelizable
    Generating NVIDIA GPU code
    60, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
    61, ! blockidx%x threadidx%x collapsed
    62, ! blockidx%x threadidx%x collapsed

and here is the output with the private(p_t, p_qv, p_qc):

nvfortran -g -pg -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,managed,implicitsections -stdpar -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -cpp -acc -o test_pointer_4 test_pointer_4ok.f90
test_pointer:
56, Generating implicit copy(qc(:,:,:),p_pair) [if not already present]
60, Loop is parallelizable
61, Loop is parallelizable
62, Loop is parallelizable
Generating NVIDIA GPU code
60, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
61, ! blockidx%x threadidx%x collapsed
62, ! blockidx%x threadidx%x collapsed

I couldn’t find any “private” information.

Thanks.

As I suspected, the compiler is implicitly copying the pointers which would make them shared by the threads. This should cause a race condition and why I’m surprised it ran correctly for you. You’ll definitely want to add these to the private clause.

Due to verbosity and redundancy, variables in a private clause don’t appear in the compiler feedback.

-Mat

Thank you Mat.

I was expecting a race condition, but it looks like it didn’t happen even in a large code. I agree that it is better or safer to keep the private clause, but sometimes it will raise other compiling problems like “Scalar last value needed after loop for …” and “Loop carried dependence of … prevents parallelization”
As shown in the following code, if I do not use private clause, I got no complains from the compilation and I can parallelize both the two loops. however, If I do use the private clause, I will get complains from the compilation and lose one parallel loop (although the answer is correct in this small case, in a bigger case float point error or 700 error will happen):

 1       !In this case, we show a quite complicate case with derived type
  2       !arrays, pointers to members of derived type variables. 
  3       
  4        MODULE m_fields
  5       
  6                 TYPE CELL_TYPE
  7                  LOGICAL :: SOLID=.FALSE.                            !< Indicates if grid cell is solid or not
  8                  INTEGER :: I,J,K                                    !< Indices of cell
  9                 END TYPE CELL_TYPE
 10                 
 11                 type MESH_TYPE
 12                        TYPE(CELL_TYPE), ALLOCATABLE, DIMENSION(:) :: CELL 
 13                        INTEGER :: N_OBST=0
 14                 end type
 15          
 16          TYPE (MESH_TYPE), SAVE, DIMENSION(:), ALLOCATABLE, TARGET :: MESHES
 17          TYPE (MESH_TYPE), SAVE, DIMENSION(:), ALLOCATABLE, TARGET :: MESHES_2
 18 
 19        END MODULE m_fields
 20 
 21 
 22        program test_pointer
 23                use m_fields
 24                
 25                integer :: i, j, k
 26                integer :: n=100
 27                
 28                TYPE (MESH_TYPE),  pointer :: M=>null()
 29                TYPE (MESH_TYPE),  pointer :: M_2=>null() 
 30                TYPE(CELL_TYPE), pointer, DIMENSION(:) :: CELL=>null() 
 31                TYPE(CELL_TYPE), pointer, DIMENSION(:) :: CELL_2=>null()
 32                TYPE(CELL_TYPE), pointer :: one_CELL=>null()
 33 
 34                ALLOCATE ( MESHES(n) ) 
 35                ALLOCATE ( MESHES_2(n) )
 36                do i=1, n
 37                   ALLOCATE(meshes(i)%CELL(n))
 38                   ALLOCATE(meshes_2(i)%CELL(n))
 39                enddo
 40 
 41                M=>MESHES(n)
 42                M_2=>MESHES_2(n)
 43 
 44                CELL=>M%CELL
 45                CELL_2=>M_2%CELL
 46 
 47 
 48                !!$acc data 
 49                !$acc enter data copyin(CELL, CELL_2)
 50 
 51                !$acc kernels
 52 
 53                !!$acc loop independent collapse(1) 
 54                !$acc loop independent collapse(1) private(one_CELL)
 55                do i=1, n
 56                   one_CELL => CELL(i)
 57                   one_CELL%SOLID=.TRUE.
 58                   one_CELL%I=i
 59                   one_CELL%J=2*i
 60                   one_CELL%K=3*i
 61                enddo
 62 
 63                !!$acc loop independent collapse(1) 
 64                !$acc loop independent collapse(1) private(one_CELL)
 65                do j=1, n
 66                   one_CELL => CELL_2(j)
 67                   one_CELL%SOLID=.TRUE.
 68                   one_CELL%I=j
 69                   one_CELL%J=j
 70                   one_CELL%K=j
 71                enddo
 72 
 73                !$acc end kernels
 74 
 75                !$acc exit data copyout(CELL, CELL_2)
 76                !$acc exit data delete(CELL, CELL_2)
 77 
 78                !!$acc end data 
 79 
 80                print*, CELL(n)%SOLID,  CELL(n)%I,CELL(n)%J, CELL(n)%K
 81                print*, CELL_2(n)%SOLID,  CELL_2(n)%I,CELL_2(n)%J, CELL_2(n)%K
 82 
 83                do i=1, n
 84                   DEALLOCATE(meshes(i)%CELL)
 85                   DEALLOCATE(meshes_2(i)%CELL)
 86                enddo
 87                DEALLOCATE ( MESHES, MESHES_2)
 88 
 89        end program

nvfortran -g -pg -Mlarge_arrays -m64 -Wall -Werror -gpu=ccall,managed,implicitsections -stdpar -traceback -ffpe-trap=invalid,zero,overflow -Minfo=accel -cpp -acc -o test_pointer_6 test_pointer_6.f90
test_pointer:
58, Generating enter data copyin(cell_2(:),cell(:))
60, Generating implicit copy(cell_2(:),cell(:)) [if not already present]
64, Scalar last value needed after loop for one_cell at line 73
Accelerator serial kernel generated
CUDA shared memory used for one_cell
Generating NVIDIA GPU code
64, !$acc loop seq
64, Scalar last value needed after loop for one_cell at line 73
Complex loop carried dependence of one_cell%solid,one_cell%j,one_cell%i prevents parallelization
Loop carried dependence of one_cell%i prevents parallelization
Loop carried backward dependence of one_cell%i,one_cell%k prevents vectorization
Loop carried dependence of one_cell%j prevents parallelization
Loop carried backward dependence of one_cell%j prevents vectorization
Loop carried dependence of one_cell%k prevents parallelization
Loop carried backward dependence of one_cell%k prevents vectorization
Loop carried dependence of one_cell%k prevents parallelization
74, Loop is parallelizable
Generating NVIDIA GPU code
74, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
74, Local memory used for one_cell
84, Generating exit data copyout(cell(:),cell_2(:))
85, Generating exit data delete(cell_2(:),cell(:))

TP# ./*6
libcupti.so not found
T 100 200 300
T 100 100 100

Accelerator Kernel Timing data
/notebooks/ParallelProgrammingWithOpenACC/Chapter13/example_openacc11/TP/test_pointer_6.f90
test_pointer NVIDIA devicenum=0
time(us): 0
58: data region reached 1 time
60: compute region reached 1 time
64: kernel launched 1 time
grid: [1] block: [1]
elapsed time(us): total=774 max=774 min=774 avg=774
74: kernel launched 1 time
grid: [1] block: [128]
elapsed time(us): total=21 max=21 min=21 avg=21
60: data region reached 2 times
84: data region reached 1 time
85: data region reached 1 time

Thank you very much!

Looks like the compiler is detecting a dependency between the loops where it thinks the one_CELL’s value from the first loop is needed in the second.

The work arounds are to use a different variable in the second loop, such as “one_CELL2”, or nullify the variable between the loops to break the dependency. Something like:

!$acc enter data copyin(CELL, CELL_2)
!$acc kernels loop independent private(one_CELL)
do i=1, n
 one_CELL => CELL(i)
 one_CELL%SOLID=.TRUE.
 one_CELL%I=i
 one_CELL%J=2*i
 one_CELL%K=3*i
enddo
one_CELL=>null()
!$acc kernels loop independent private(one_CELL)
do j=1, n
 one_CELL => CELL_2(j)
 one_CELL%SOLID=.TRUE.
 one_CELL%I=j
 one_CELL%J=j
 one_CELL%K=j
enddo
!$acc exit data copyout(CELL, CELL_2)
% nvfortran -acc -Minfo=accel test.F90
test_pointer:
     48, Generating enter data copyin(cell_2(:),cell(:))
     49, Generating implicit copy(cell(:)) [if not already present]
     50, Loop is parallelizable
         Generating NVIDIA GPU code
         50, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     50, Local memory used for one_cell
     58, Generating implicit copy(cell_2(:)) [if not already present]
     59, Loop is parallelizable
         Generating NVIDIA GPU code
         59, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     59, Local memory used for one_cell
     66, Generating exit data copyout(cell_2(:),cell(:))
% a.out
  T          100          200          300
  T          100          100          100