What is the correct way of working with arrays of derived types with allocatable components?

In my code I have some sections where, on the older CPU code, I have some derived types which contain arrays of allocatable data. I have identified some sections of the code where I believe this data can be declared as device only. What is the correct way to work with this kind of type? For example, a program like this

module test
    implicit none

    type :: array_real_1D_d
        real, dimension(:), device, allocatable :: data
    end type array_real_1D_d

    type :: array_real_1D_h
        real, dimension(:), allocatable :: data
    end type array_real_1D_h

contains

    subroutine sum_face()
        implicit none

        integer, parameter :: a = 2
        integer, parameter :: b = 2

        type(array_real_1D_d), device, allocatable :: weight(:, :)
        type(array_real_1D_d), device, allocatable :: face_d(:)
        type(array_real_1D_h), allocatable :: face_h(:)

        real :: weight_sum
        integer :: k, l, i, j, n

        allocate (weight(a, a))
        do i = 1, a
            do j = 1, a
                allocate (weight(i, j)%data(b))
                do l = 1, b
                    weight(i, j)%data(l) = 1.0
                end do
            end do
        end do
        allocate (face_h(a))
        allocate (face_d(a))
        do i = 1, a
            allocate (face_h(i)%data(b))
            allocate (face_d(i)%data(b))
            do n = 1, b
                face_d(i)%data(n) = 0.0
            end do
        end do

        do concurrent(k=1:a, i=1:b)
            weight_sum = 0.0
            do concurrent(l=1:a) reduce(+:weight_sum)
                weight_sum = weight_sum + weight(k, l)%data(i)
            end do
            face_d(k)%data(i) = face_d(k)%data(i) + weight_sum
        end do

        print *, "Face sum completed."
        print *, "Face array:"
        do i = 1, a
            do j = 1, a
                face_h(i)%data(:) = face_d(i)%data(:)
                print *, face_h(i)%data
            end do
        end do

        deallocate (weight)
        deallocate (face_d)
        deallocate (face_h)

    end subroutine sum_face

end module test

program main
    use test
    implicit none

    call sum_face()

end program main

Will generate some compiler errors when compiled with

 nvfortran -cuda -stdpar=gpu -O0 -g test_derived.f90 -o test_derived
NVFORTRAN-S-0519-More than one reference to a device-resident object in assignment (test_derived.f90: 32)
NVFORTRAN-S-0519-More than one reference to a device-resident object in assignment (test_derived.f90: 42)
NVFORTRAN-S-0519-More than one reference to a device-resident object in assignment (test_derived.f90: 58)
  0 inform,   0 warnings,   3 severes, 0 fatal for sum_face

But is the definition of a separate device and host derived types even a good approach for this kind of problem?

Furthermore, I would like to use cuf directives to avoid polluting the code with #ifdef but replacing the device definitions with

...
    type :: array_real_1D_d
        real, dimension(:), allocatable :: data
        !@cuf attribute(device) :: data
    end type array_real_1D_d
...
        type(array_real_1D_d), allocatable :: weight(:, :)
        !@cuf attribute(device) :: weight
        type(array_real_1D_d), allocatable :: face_d(:)
        !@cuf attribute(device) :: face_d
...

Leads me to the following error:

NVFORTRAN-S-0310-Illegal statement in the specification part of a MODULE  (test_derived.f90: 6)
  0 inform,   0 warnings,   1 severes, 0 fatal for test
NVFORTRAN-S-0034-Syntax error at or near :: (test_derived.f90: 22)
NVFORTRAN-S-0034-Syntax error at or near :: (test_derived.f90: 24)
  0 inform,   0 warnings,   2 severes, 0 fatal for sum_face

In general, how would I go about working with this kind of data? I am not sure flattening the arrays would be an option due to the large scale of the problem.

Change “device” to “managed” for the top level type arrays:

        type(array_real_1D_d), device, allocatable :: weight(:, :)
        type(array_real_1D_d), device, allocatable :: face_d(:)

to

        type(array_real_1D_d), managed, allocatable :: weight(:, :)
        type(array_real_1D_d), managed, allocatable :: face_d(:)

The problem being that these can’t be accessed on the host so when you have:

allocate (weight(i, j)%data(b))

this would cause a seg fault as it would be using a device address for “weight(i,j)”. Changing to “managed” allows it to be accessed on both the host and device.

NVFORTRAN-S-0519-More than one reference to a device-resident object in assignment (test_derived.f90: 32)

This error is because you have a device to device assignment on the host. You can only have host to device or device to host.

Thank you! That was very clear, apparently my original problem, and the reason why I was exploring using the device attribute, was a bit more complex. This is not really a minimal reproducible example, but one that maintains the core logic. Where my project is structured this way:

âžś  arborescence-compil git:(stdpar) âś— tree test_cmake
test_cmake
├── app
│   ├── CMakeLists.txt
│   └── main.f90
├── CMakeLists.txt
└── src
    ├── CMakeLists.txt
    └── lib.f90

3 directories, 5 files

and I have

# CMakeLists.txt
cmake_minimum_required(VERSION 3.24...4.0)

project(hawen VERSION 1.4.1 LANGUAGES Fortran CUDA)

set(CMAKE_Fortran_MODULE_DIRECTORY ${PROJECT_BINARY_DIR}/include)

add_subdirectory(src)
add_subdirectory(app)
# src/CMakeLists.txt
add_library(hawen_lib STATIC lib.f90)
target_compile_options(hawen_lib PRIVATE -cuda -stdpar=gpu -Minfo=all)
target_link_options(hawen_lib PUBLIC -static-nvidia -cuda)
# app/CMakeLists.txt
add_executable(main main.f90)
target_link_libraries(main PRIVATE hawen_lib)
! src/lib.f90
module lib_mod
    use cudafor
    implicit none
    private
    public :: dg, array_2D_d, sum_face

    type :: array_2D_d
        real, device, allocatable :: data(:, :)
    end type array_2D_d

    type :: array_3D_d
        real, device, allocatable :: data(:, :, :)
    end type array_3D_d

    type dg
        type(array_3D_d), managed, allocatable :: weight(:)
    end type dg

contains

    subroutine sum_face(ctx_dg, face)
        implicit none

        type(dg), target, intent(in) :: ctx_dg
        type(array_2D_d), intent(inout) :: face(:)

        integer :: i, j, p
        real :: rhosum
        real, allocatable :: rho(:)
        type(array_3D_d), pointer :: weight(:)

        rho = [1.0, 1.0, 1.0, 1.0]

        weight => ctx_dg%weight

        do concurrent(j=1:3, i=1:3)
            rhosum = 0.0

            do concurrent(p=1:4) reduce(+:rhosum)
                rhosum = rhosum + weight(1)%data(i, j, p)*rho(p)
            end do

            face(1)%data(i, j) = rhosum
        end do

        deallocate (rho)
    end subroutine

end module lib_mod
! app/main.f90
module main_mod
    use lib_mod, only: dg, array_2D_d, sum_face
    implicit none
    private

    public :: first

contains

    subroutine init_dg(ctx_dg)
        type(dg), intent(out) :: ctx_dg

        allocate (ctx_dg%weight(1))
        allocate (ctx_dg%weight(1)%data(3, 3, 4))
        ctx_dg%weight(1)%data &
            = reshape([0.25, 0.1, 0.1, 0.1, 0.25, 0.1, 0.1, 0.1, 0.25, &
                       0.25, 0.1, 0.1, 0.1, 0.25, 0.1, 0.1, 0.1, 0.25, &
                       0.25, 0.1, 0.1, 0.1, 0.25, 0.1, 0.1, 0.1, 0.25, &
                       0.25, 0.1, 0.1, 0.1, 0.25, 0.1, 0.1, 0.1, 0.25], [3, 3, 4])
    end subroutine init_dg

    subroutine first()
        implicit none

        type(array_2D_d), allocatable :: vol_phi_phi_rho_d(:)
        type(dg) :: ctx_dg

        allocate (vol_phi_phi_rho_d(1))
        allocate (vol_phi_phi_rho_d(1)%data(3, 3))

        call init_dg(ctx_dg)

        call sum_face(ctx_dg, vol_phi_phi_rho_d)
    end subroutine first
end module main_mod

program main
    use main_mod
    implicit none

    call first()
end program main

Running with

cmake -B build -DCMAKE_BUILD_TYPE=Debug && cmake --build build --config Debug && build/app/main

will cause this error

Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
 File: /home/eduard/Github/hawen_worktree/arborescence-compil/test_cmake/src/lib.f90
 Function: sum_face:22
 Line: 37

This kind of error does not appear when first and init_dg are defined inside the lib.f90 file, so there is probably something weird happening with the linking. Running with cuda-gdb leads me to the following error:

(cuda-gdb) run
Starting program: /home/eduard/Github/hawen_worktree/arborescence-compil/test_cmake/build/app/main 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff35ff000 (LWP 1054119)]
[New Thread 0x7ffff21ff000 (LWP 1054120)]
[Detaching after fork from child process 1054121]
[New Thread 0x7fffebfff000 (LWP 1054135)]
[New Thread 0x7fffeb7fe000 (LWP 1054136)]

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7fffd7271f80  lib_mod_sum_face_37_gpu  (lib.f90:41)

Thread 1 "main" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 2, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 1, lane 0]
0x00007fffd7271f90 in lib_mod_sum_face_37_gpu<<<(1,1,1),(32,1,1)>>> (face=0x7fffcd2fa200, rho=0x7fffca000000, weight=0x7fffcd2fa000)
    at /home/eduard/Github/hawen_worktree/arborescence-compil/test_cmake/src/lib.f90:41
41                      rhosum = rhosum + weight(1)%data(i, j, p)*rho(p)
(cuda-gdb) info locals
blockdim = ( x = 32, y = 1, z = 1 )
blockidx = ( x = 0, y = 0, z = 0 )
threadidx = ( x = 0, y = 0, z = 0 )
(cuda-gdb) info args
face = 0x7fffcd2fa200
rho = 0x7fffca000000
weight = 0x7fffcd2fa000
(cuda-gdb) p weight(1)%data


Fatal signal: Floating point exception
----- Backtrace -----
0x54234b ???
0x6ee661 ???
0x77b88c64580f ???
        ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
0x9aebc3 ???
0x99da40 ???
0x99dd1d ???
0x9a19d9 ???
0x6fe67a ???
0x6e899c ???
0x8296b1 ???
0x829db5 ???
0x56ed6f ???
0x95a4be ???
0x6eee2e ???
0x6eff8d ???
0x6ef67c ???
0x9d8097 ???
0x6ee98d ???
0x6ef56d ???
0x98e01f ???
0xb3d77c ???
0xb3d902 ???
0x7d1e56 ???
0x7d38a4 ???
0x44ae64 ???
0x77b88c62a337 __libc_start_call_main
        ../sysdeps/nptl/libc_start_call_main.h:58
0x77b88c62a3fa __libc_start_main_impl
        ../csu/libc-start.c:360
0x45e7fd ???
0xffffffffffffffff ???
---------------------
A fatal error internal to GDB has been detected, further
debugging is not possible.  GDB will now terminate.

This is a bug, please report it.  For instructions, see:
<https://forums.developer.nvidia.com/c/developer-tools/cuda-developer-tools/cuda-gdb>.

[1]    1054063 floating point exception (core dumped)  cuda-gdb build/app/main

Looks like you’re missing the GPU flags when building main. Add to “app/CMakeLists.txt”:

target_compile_options(main PRIVATE -cuda)

Works for me after I make this change.

1 Like

Thank you, so I have to use it even if I don’t explicitly use cuda fortran in an external module. I tried this solution originally, but I was getting cuda errors for cudaFree_v2 getting called with an illegal address, even before any code was actually executed. I cannot reproduce this issue and I haven’t changed anything in the code since then so I will attribute it to something weird with the drivers and move on. Thank you again for the help!

FYI, You actually do use CUDA Fortran in main, it’s just not obvious. “data” is defined as a device array but without -cuda, it will get allocated using the host allocator.

That makes sense, it’s strange that this does not lead to a compile error, like it would if the annotation was in the module without the -cuda compile flag