allocatable arrays inside device data structures

Hello,
I’m trying to use allocatable arrays inside user-defined types, with the whole data structure residing on the GPU.

Here’s my module declaration:

!=============
! This module contains definitions for data structures and the data
! stored on the device
!=============

   module GPU_variables
   use cudafor

   type :: data_str_def

!=============
! single number quantities
!=============

      integer                       :: i, j 
      real(kind=8)                  :: a 

!=============
! Arrays
!=============

      real(kind=8),   allocatable   :: b(:)
      real(kind=8),   allocatable   :: c(:,:)
      real(kind=8),   allocatable   :: d(:,:,:)
      real(kind=8),   allocatable   :: e(:,:,:,:)

   end type data_str_def

!=============
! Actual data is here
!=============

   type(data_str_def), device, allocatable   :: data_str(:)

   contains

!=============
! subroutine to allocate memory
!=============

      subroutine allocate_mem(n1)
      implicit none 
      integer, intent(in)  :: n1 

      call deallocate_mem()

      write(*,*) 'works here'
      allocate(data_str(n1))

      write(*,*) 'what about allocating memory?'
      allocate(data_str(n1) % b(10))
      write(*,*) 'success!'

      return
      end subroutine allocate_mem

!=============
! subroutine to deallocate memory
!=============

      subroutine deallocate_mem()
      implicit none
      if(allocated(data_str)) deallocate(data_str)
      return 
      end subroutine deallocate_mem

   end module GPU_variables

Calling program is

!=============
! main program 
!=============

    program gpu_test
    use gpu_variables
    implicit none

!=============
! local variables
!=============

    integer             :: i, j, n

!=============
! allocate data
!=============

    n       = 2                 ! number of data structures

    call allocate_mem(n)

!=============
! dallocate device data structures and exit
!=============

    call deallocate_mem()
    end program

module file is called gpu_modules.F90
mainprogram file is called gpu_test.F90

compilation command is

pgfortran -Mcuda=cc5x *.F90

Terminal output is

$ ./a.out
works here
what about allocating memory?
Segmentation fault (core dumped)

The idea was to use GPU memory in modules so that subroutines have access to the data, and data structures are a nice way to organize variable-sized arrays.

Am I doing something obviously wrong? Please help!

Hi Ananth_Srid,

Unfortunately this isn’t going to work this way since you would need to access the device array in order to allocate the type’s arrays. This can’t be done from the host. You could try writing a kernel that allocates the arrays on the device, but the easiest thing to do is use CUDA Unified Memory (i.e. the “managed” attribute) so the same addresses can be accessed from either the host or device.

For example:

 % cat gpu_modules.cuf
!=============
 ! This module contains definitions for data structures and the data
 ! stored on the device
 !=============

    module GPU_variables
    use cudafor

    type :: data_str_def

 !=============
 ! single number quantities
 !=============

       integer                       :: i, j
       real(kind=8)                  :: a

 !=============
 ! Arrays
 !=============

       real(kind=8),   allocatable, managed ::  b(:)
       real(kind=8),   allocatable, managed :: c(:,:)
       real(kind=8),   allocatable, managed :: d(:,:,:)
       real(kind=8),   allocatable, managed :: e(:,:,:,:)

    end type data_str_def

 !=============
 ! Actual data is here
 !=============

    type(data_str_def), managed, allocatable   :: data_str(:)

    contains

 !=============
 ! subroutine to allocate memory
 !=============

       subroutine allocate_mem(n1)
       implicit none
       integer, intent(in)  :: n1

       call deallocate_mem()

       write(*,*) 'works here', n1
       allocate(data_str(n1))

       write(*,*) 'what about allocating memory?'
       allocate(data_str(n1) % b(10))
       write(*,*) 'success!'

       return
       end subroutine allocate_mem

 !=============
 ! subroutine to deallocate memory
 !=============

       subroutine deallocate_mem()
       implicit none
       if(allocated(data_str)) deallocate(data_str)
       return
       end subroutine deallocate_mem

    end module GPU_variables


% pgfortran gpu_modules.cuf gpu_test.cuf -Mcuda=cc60 ; a.out
gpu_modules.cuf:
gpu_test.cuf:
 works here            2
 what about allocating memory?
 success!

Hope this helps,
Mat

hi Mat,
thanks for the suggestion! I’ll try it out and report back here

cheers
Ananth

hi Mat,
I want to have fine control over transfers for performance optimization and benchmarks, so …

I tried to implement your first suggestion - write a kernel to allocate the memory. However, when I try to allocate memory from within the kernel, the compiler throws an error: “unsupported procedure”

PGF90-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unsupported procedure (gpu_modules.F90: 1)

!=====================================================================
! This module contains definitions for data structures and the data
! stored on the device
!=====================================================================

   module gpu_variables
   use cudafor

   type :: data_str_def

      real(kind=8),   allocatable   :: b(:)

   end type data_str_def

!=====================================================================
! Actual data is here
!=====================================================================

   type(data_str_def), device, allocatable   :: data_str(:)

!=====================================================================
! routines follow data
!=====================================================================

   contains

!=====================================================================
! kernel to allocate memory
!=====================================================================

      attributes(global) subroutine allocate_memory(r, bdim)
      implicit none 

      integer, value    :: r, bdim
      integer, device   :: i, j, k

      i     = threadIdx%x + (blockIdx%x - 1)*blockDim%x 
      j     = threadIdx%y + (blockIdx%y - 1)*blockDim%y 
      k     = threadIdx%z + (blockIdx%z - 1)*blockDim%z 
      
      if(i == 1 .and. j == 1 .and. k == 1) then
         allocate(data_str(r) % b(bdim))
      end if 

      end subroutine allocate_memory

!=====================================================================
! kernel to deallocate memory 
!=====================================================================

      subroutine deallocate_memory()
      implicit none 

      if(allocated(data_str)) deallocate(data_str)

      end subroutine deallocate_memory

   end module GPU_variables

and then call it using

!=============
! main program 
!=============

   program gpu_test
   use gpu_variables
   implicit none

!=============
! local variables
!=============

   integer          :: i, j, n
   type(dim3)       :: grid, block   

!=============
! allocate data on cpu first
!=============

    call deallocate_memory()
    n       = 2                 ! number of data structures
    allocate(data_str(n))

    grid    = dim3(1,1,1)
    block   = dim3(1,1,1)
   
    call allocate_memory<<<grid,block>>>(1, 10)

!=============
! dallocate data structures and exit
!=============

    call deallocate_memory()

    end program

Any help you could provide would be great.

Hi Ananth,

I was playing around with this and I don’t see any easy way to get this to work. You’ll need to use “managed” or make “b” a fixed size array (then you don’t need to allocate it on the device).

Sorry,
Mat

hi Mat,
thanks for trying it out. I’ll use a workaround, and keep an eye on new developments.

Do you know if this issue is on PGI’s plans : short/long term?