I am trying to understand in Cuda Fortran why a derived type containing only device allocatable arrays needs to have a managed attribute rather than allowing for a device attribute. Copy of sample test code follows. The derived type in question is extracted here:
type:: cs1
real,device,allocatable,dimension(:,:):: s
real,device,allocatable,dimension(:) :: cfl
end type cs1
type(cs1),managed,allocatable,dimension(:):: edge_stream ! This works ok
!!type(cs1),device,allocatable,dimension(:):: edge_stream ! This causes core dump
Thank you.
cuda_stream_module_f90.txt (5.1 KB)
This part:
allocate(edge_stream(n)%s(m1,m2))
allocate(edge_stream(n)%cfl(m2))
If “edge_stream” is a device array then it can’t be accessed on the host. Hence when you try to allocate the data members, the host seg faults given “edge_stream” needs to be accessed.
“managed” memory is accessible on both the host and device, and why it works.
Is there a way I can allocate the array elements within the device-attribute derived type without resorting
to using managed memory?
I tried making an attributes(global) alloc_gpu() kernel of just one thread to try
and allocate the array elements, but compiler would not permit the procedure.
I noticed (using above example) that if I assign edge_stream the “managed” attribute but the allocatable
array elements (s,cfl) the “device” attribute, it compiles and runs successfully.
Moreover, I seem to also to be able to use a local pointer (e_s) with device routines that point to
the managed array edge_stream. Am I doing this memory management efficiently?
Correct, only the parent type needs to be managed. The allocatable array members can have the device attribute.
Again, the problem is that device data allocation can only be initiated from the host. Hence if the parent object had “device”, then when you go to allocate the members, the allocation will seg fault given the parent’s address is deferenced on the host, and it segvs.
Moreover, I seem to also to be able to use a local pointer (e_s) with device routines that point to
the managed array edge_stream.
The pointer assignment will be to the managed memory address, but that can be used as a device address so should be fine.
Am I doing this memory management efficiently?
I don’t have enough information to say, but managed memory is more about ease of use. Though like all data management, the most efficient thing is to have the program copy the data to the device at the beginning, perform all computation on that data on the device, and then bring the results back at the end. If you need the data back on the host during the run, that’s fine, but you do add in data movement cost.
The key difference with managed, is that the CUDA driver implicitly does the data movement but only when the data is “dirty”. So if you don’t touch the data on the host, the data doesn’t get copied back.