OpenACC Derived-Data Type From Host to Device, DEEP COPY

Hello,

I have a module that declares types as below:

MODULE ABC

IMPLICIT NONE

TYPE B
INTEGER, ALLOCATABLE, DIMENSION(:,:) ::  USA
END TYPE B

TYPE A
INTEGER, ALLOCATABLE, DIMENSION(:,:) :: BRI
TYPE (B), ALLOCATABLE, DIMENSION(:) :: GER
END TYPE A

TYPE (A), ALLOCATABLE, DIMENSION(:) :: TUR

END MODULE MOD_ABC

Then I use this module in a main program and call the derived-data types from device as follows:

  !$acc enter data copyin(TUR(1))
  !$acc data copyin(FUN(:,:,:),ELFUN(:,:,:,:)) create(NO,I0) copy(PPV(:,:,:)) 
  !$acc parallel loop default(present)
  DO NUC = 1,Z
    DO EL = K,L
	DO J = 1T
	  NO = TUR(1)%GER(J_P)%USA(EL,J)
	  I0 = TUR(1)%BRI(NO)
      ! extra calculations
	END DO !J
    END DO ! EL
  END DO
  !$acc end parallel loop
  !$acc end data

I included “deepcopy” flag on the Makefile.
When I run this program, it gives the error that

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

I looked at the manuals of NVIDIA. Yet I could not understand how to manage derived-data types.
I could not find a whole sample code that shows such an example. I am also open to online recommendations for sample codes.

Thanks for your attention!

Hi yunus.altintop.2,

It’s difficult to say what’s wrong given the incomplete example and Fortran syntax errors (module name mismatch, “1T” should be “1,T”). Are you able to post a small reproducing example?

Also, is it your intent to make NO and I0 shared? Seems like these should be private else you’ll get collisions.

I rewrote and simplifies your example below and deep copy is working as expected. Hence, I suspect something else is causing your illegal address error. Exactly what is unclear, but hopefully you can modify what I have below to show the error.

% cat test.f90
MODULE MOD_ABC

IMPLICIT NONE

TYPE B
INTEGER, ALLOCATABLE, DIMENSION(:,:) ::  USA
END TYPE B

TYPE A
INTEGER, ALLOCATABLE, DIMENSION(:,:) :: BRI
TYPE (B), ALLOCATABLE, DIMENSION(:) :: GER
END TYPE A

TYPE (A), ALLOCATABLE, DIMENSION(:) :: TUR

END MODULE MOD_ABC

program foo
   use MOD_ABC
   integer ::  J,K,L,Z,NUC,EL,T

   K=1
   L=32
   T=32
   allocate(TUR(1))
   allocate(TUR(1)%GER(K:L))
   DO EL = K,L
     allocate(TUR(1)%GER(EL)%USA(K:J,1:T))
     DO J = 1,T
       TUR(1)%GER(EL)%USA(EL,J)=1
     enddo
   enddo
  NO=1
  !$acc enter data copyin(TUR(:1))
  !$acc parallel loop default(present) reduction(+:NO)
  DO EL = K,L
     DO J = 1,T
          NO = NO + TUR(1)%GER(EL)%USA(EL,J)
     END DO !J
  END DO
  !$acc end parallel loop
  !$acc exit data delete(TUR)
  print *,NO

end program foo
% nvfortran test.f90 -acc -gpu=deepcopy -Minfo=accel; a.out
foo:
     34, Generating enter data copyin(tur(:1))
     35, Generating Tesla code
         36, !$acc loop gang ! blockidx%x
             Generating reduction(+:no)
         37, !$acc loop vector(32) ! threadidx%x
             Interchanging generated vector loop outwards
             Interchanging generated strip mine loop outwards
     35, Generating default present(tur(:))
         Generating implicit copy(no) [if not already present]
     37, Loop is parallelizable
     42, Generating exit data delete(tur(:))
         1025

-Mat

1 Like

It is very clean explanation and example.
When I add the exit data delete(TUR) command and also adjust deepcopy flag in the Makefile as you done, it worked.
Yet I learned explicit transfer of data from the YouTube videos of Michael Wolfe. Do you think the implicit deep copy is better than the explicit, is it faster?

Thanks a lot!

I personally typically use explicit deep copy, but that’s more a matter of style and what I’m used to doing. Implicit deep copy is much simpler to use and requires less code. Performance is about the same.

The one caveat with implicit deep copy is that the entire type and sub-types will be copied. So if only a portion of the type is actually used on the device, the you will end up using extra memory on the device and take more time copying unneeded data.

1 Like