Device-to-device transfers involving components of derived types

Dear all,

I am a fairly new to the world of CUDA Fortran, and I’ve stumbled upon something I can’t quite wrap my head around. I am strictly working with compute capability 6.0.

The situation: I have a host-side derived type containing an allocatable device array. However, only certain operations seem to work with those “nested” arrays after allocation unless they are wrapped in procedure calls. I tried to construct a minimal example for this issue (it’s on pastebin for formatting, but I can also repost it here if need be):

Derived-type module
Driver program

The driver program copies a host array to both an “explicit” device array and a component of the derived type (also supposedly a device array). Then it copies both device arrays back to a host array, covering both host-to-device transfer cases. This works just fine, and the writes “b” and “c” give the correct output.

Things start to break down when the program attempts to make device-to-device copy involving a “nested” reference to the device array within the derived type (d_hst%arr = arr_d; “More than one device-resident object in assignment”). Strangely enough, it does work if the copying is done within a subroutine call instead (here: “copy_device_array” call).

My question is now: Why does device-to-device copying work within a subroutine call but not “directly” for this particular case? I fear I might be misunderstanding certain things on a fundamental level here, but I’d be more than happy to learn what is at work here.

Thank you in advance.

Edit: Changed line references to explicit statements

Hi rng_sus,

but I can also repost it here if need be)

Please do. Unfortunately my firewall blocks pastebin so I can’t access the example.

Once I can see the code, hopefully I can offer some advice.

-Mat

Hello Mat,

here is the source code:

Derived-type module

module D_Type

  use cudafor

  implicit none

  ! Host-resident derived type containing a device array
  type dev_arr

     integer, device, allocatable :: arr(:)

  end type dev_arr

end module D_Type

Driver program

program Derived
  
  use D_type

  implicit none

  ! Parameters
  integer, parameter :: arr_dim = 4 ! Array dimension

  ! Variables
  type(dev_arr) :: d_hst            ! Derived device array
  integer, device :: arr_d(arr_dim) ! Standard device array
  integer :: arr(arr_dim)           ! Host array
  integer :: i


  ! Fill and print host array
  do, i = 1, arr_dim

     arr(i) = i

  end do

  write(*, *) "a", arr
  
  ! Allocate derived device array
  allocate(d_hst%arr(arr_dim))

  ! Host-to-device copy operations
  arr_d = arr + 1     ! Works
  d_hst%arr = arr + 2 ! Works

  arr = arr_d
  write(*, *) "b", arr

  arr = d_hst%arr
  write(*, *) "c", arr ! 1 greater than in "b"

  ! Device-to-device copy operations
  d_hst%arr = arr_d                        ! Does not work
  call copy_device_array(d_hst%arr, arr_d) ! Works
  
  arr = d_hst%arr
  write(*, *) "d", arr ! same as in "b"

contains

  subroutine copy_device_array(tgt, sr)
    ! Device-to-device copy from one device array to another

    ! Arguments
    integer, device, intent(out) :: tgt(arr_dim)
    integer, device, intent(in) :: sr(arr_dim)


    tgt = sr

  end subroutine copy_device_array
  
end program Derived

Many thanks for your help in advance.

Hi rng_sus,

Not unexpected in the case, but I’ve put in an RFE (TPR #28340) to see if it’s something we could support in the future.

Easy work around is to use a call to cudaMemCpy instead:

% cat main.CUF
module D_Type

  use cudafor

  implicit none

  ! Host-resident derived type containing a device array
  type dev_arr

     integer, device, allocatable :: arr(:)

  end type dev_arr

end module D_Type


program Derived

  use D_type
  use cudafor

  implicit none

  ! Parameters
  integer, parameter :: arr_dim = 4 ! Array dimension

  ! Variables
  type(dev_arr) :: d_hst            ! Derived device array
  integer, device :: arr_d(arr_dim) ! Standard device array
  integer :: arr(arr_dim)           ! Host array
  integer :: i, istat


  ! Fill and print host array
  do, i = 1, arr_dim

     arr(i) = i

  end do

  write(*, *) "a", arr

  ! Allocate derived device array
  allocate(d_hst%arr(arr_dim))

  ! Host-to-device copy operations
  arr_d = arr + 1     ! Works
  d_hst%arr = arr + 2 ! Works

  arr = arr_d
  write(*, *) "b", arr

  arr = d_hst%arr
  write(*, *) "c", arr ! 1 greater than in "b"

  ! Device-to-device copy operations
#ifndef USE_MEMCPY
  d_hst%arr = arr_d                        ! Does not work
#else
   istat = cudaMemCpy(d_hst%arr,arr_d,arr_dim,cudaMemcpyDeviceToDevice)
#endif
!   call copy_device_array(d_hst%arr, arr_d) ! Works

  arr = d_hst%arr
  write(*, *) "d", arr ! same as in "b"

contains

  subroutine copy_device_array(tgt, sr)
    ! Device-to-device copy from one device array to another

    ! Arguments
    integer, device, intent(out) :: tgt(arr_dim)
    integer, device, intent(in) :: sr(arr_dim)


    tgt = sr

  end subroutine copy_device_array

end program Derived
% pgfortran main.CUF -DUSE_MEMCPY; a.out
 a            1            2            3            4
 b            2            3            4            5
 c            3            4            5            6
 d            2            3            4            5

Hope this helps,
Mat

Hello Mat,

is there a deeper reason why this “problem” is only encountered in device-to-device transfers involving references to device components of derived types? As I recall, device-to-device transfers among “ordinary” device arrays (like arr_d in the example) can be done through a simple assignment operation without having to invoke cudaMemCpy explicitly.

In the same vein, why does the host-to-device transfer still work while the device-to-device one requires an explicit invocation to cudaMemCpy in this case? Most importantly (to me), how does the subroutine call in the example (apparently) circumvent this issue?

I ask these things because this minimal example is just one of many situations were I’ve run into issues with references to device components of derived types in place of “ordinary” device arrays. For instance, I could not use such components in cuf kernels either unless they had been passed through a subroutine call (like in the minimal example) before. I’d like to understand what’s going on “behind the scenes” to some degree so I can anticipate and plan around this behaviour in the future.

is there a deeper reason why this “problem” is only encountered in device-to-device transfers involving references to device components of derived types?

No I don’t think there’s a deep reason other than we’re missing this case. The compiler is basically just doing a pattern match and then substituting the pattern to calls to cudaMemCpy.

Though even if we are able to find this pattern, multi-level deep derived types, i.e. “T%A%B_d”, probably will still need you to use the direct call. One level isn’t too difficult for the compiler to figure out, but adding more levels does increase the complexity.

how does the subroutine call in the example (apparently) circumvent this issue?

Because the compiler recognizes the pattern “A_d=B_d” and is just missing the “T%A_d=B_d” case.

Hope this helps,
Mat

Hello Mat,

many thanks for your help so far. It’s good to know that it’s simply an internal pattern-matching issue. However, I have one last follow-up question if you don’t mind:

How come this is apparently not an issue with host-to-device transfers involving those patterns? Those worked fine without any need for workarounds in the minimal example, after all.

How come this is apparently not an issue with host-to-device transfers involving those patterns?

They’re different patterns. Meaning that its combination of the array attribute (host, device, managed, or pinned) for both the left and right hand size, and if the array is derived type member or not. Think of it as a matrix where each combination is a separate pattern that the compiler must recognize and then be able to translate to the appropriate cudaMemCopy call.

-Mat

I see. So currently, the compiler can recognise only some of the possible permutations. That’s definitely good to know. Thanks again for your patience.

Hi all,

The fix for this is included in the Nvidia HPC SDK release 20.5, more info about it here: https://developer.nvidia.com/hpc-sdk