I am a fairly new to the world of CUDA Fortran, and I’ve stumbled upon something I can’t quite wrap my head around. I am strictly working with compute capability 6.0.
The situation: I have a host-side derived type containing an allocatable device array. However, only certain operations seem to work with those “nested” arrays after allocation unless they are wrapped in procedure calls. I tried to construct a minimal example for this issue (it’s on pastebin for formatting, but I can also repost it here if need be):
The driver program copies a host array to both an “explicit” device array and a component of the derived type (also supposedly a device array). Then it copies both device arrays back to a host array, covering both host-to-device transfer cases. This works just fine, and the writes “b” and “c” give the correct output.
Things start to break down when the program attempts to make device-to-device copy involving a “nested” reference to the device array within the derived type (d_hst%arr = arr_d; “More than one device-resident object in assignment”). Strangely enough, it does work if the copying is done within a subroutine call instead (here: “copy_device_array” call).
My question is now: Why does device-to-device copying work within a subroutine call but not “directly” for this particular case? I fear I might be misunderstanding certain things on a fundamental level here, but I’d be more than happy to learn what is at work here.
Thank you in advance.
Edit: Changed line references to explicit statements
module D_Type
use cudafor
implicit none
! Host-resident derived type containing a device array
type dev_arr
integer, device, allocatable :: arr(:)
end type dev_arr
end module D_Type
Driver program
program Derived
use D_type
implicit none
! Parameters
integer, parameter :: arr_dim = 4 ! Array dimension
! Variables
type(dev_arr) :: d_hst ! Derived device array
integer, device :: arr_d(arr_dim) ! Standard device array
integer :: arr(arr_dim) ! Host array
integer :: i
! Fill and print host array
do, i = 1, arr_dim
arr(i) = i
end do
write(*, *) "a", arr
! Allocate derived device array
allocate(d_hst%arr(arr_dim))
! Host-to-device copy operations
arr_d = arr + 1 ! Works
d_hst%arr = arr + 2 ! Works
arr = arr_d
write(*, *) "b", arr
arr = d_hst%arr
write(*, *) "c", arr ! 1 greater than in "b"
! Device-to-device copy operations
d_hst%arr = arr_d ! Does not work
call copy_device_array(d_hst%arr, arr_d) ! Works
arr = d_hst%arr
write(*, *) "d", arr ! same as in "b"
contains
subroutine copy_device_array(tgt, sr)
! Device-to-device copy from one device array to another
! Arguments
integer, device, intent(out) :: tgt(arr_dim)
integer, device, intent(in) :: sr(arr_dim)
tgt = sr
end subroutine copy_device_array
end program Derived
is there a deeper reason why this “problem” is only encountered in device-to-device transfers involving references to device components of derived types? As I recall, device-to-device transfers among “ordinary” device arrays (like arr_d in the example) can be done through a simple assignment operation without having to invoke cudaMemCpy explicitly.
In the same vein, why does the host-to-device transfer still work while the device-to-device one requires an explicit invocation to cudaMemCpy in this case? Most importantly (to me), how does the subroutine call in the example (apparently) circumvent this issue?
I ask these things because this minimal example is just one of many situations were I’ve run into issues with references to device components of derived types in place of “ordinary” device arrays. For instance, I could not use such components in cuf kernels either unless they had been passed through a subroutine call (like in the minimal example) before. I’d like to understand what’s going on “behind the scenes” to some degree so I can anticipate and plan around this behaviour in the future.
is there a deeper reason why this “problem” is only encountered in device-to-device transfers involving references to device components of derived types?
No I don’t think there’s a deep reason other than we’re missing this case. The compiler is basically just doing a pattern match and then substituting the pattern to calls to cudaMemCpy.
Though even if we are able to find this pattern, multi-level deep derived types, i.e. “T%A%B_d”, probably will still need you to use the direct call. One level isn’t too difficult for the compiler to figure out, but adding more levels does increase the complexity.
how does the subroutine call in the example (apparently) circumvent this issue?
Because the compiler recognizes the pattern “A_d=B_d” and is just missing the “T%A_d=B_d” case.
many thanks for your help so far. It’s good to know that it’s simply an internal pattern-matching issue. However, I have one last follow-up question if you don’t mind:
How come this is apparently not an issue with host-to-device transfers involving those patterns? Those worked fine without any need for workarounds in the minimal example, after all.
How come this is apparently not an issue with host-to-device transfers involving those patterns?
They’re different patterns. Meaning that its combination of the array attribute (host, device, managed, or pinned) for both the left and right hand size, and if the array is derived type member or not. Think of it as a matrix where each combination is a separate pattern that the compiler must recognize and then be able to translate to the appropriate cudaMemCopy call.
I see. So currently, the compiler can recognise only some of the possible permutations. That’s definitely good to know. Thanks again for your patience.