I noticed while implementing CUDA Fortran + OpenMP that there are distinct issues with declaring device variables private in the parallel section. If I did not declare a device variable private, each CPU thread would correctly assign itself a device, but it would throw a copy error whenever the second thread tried to access the device variable. When I did declare these variables private, the CPUs would not correctly assign themselves devices. The error number was 36 “Active Process Error”. I noticed the main difference from Mat’s post in https://forums.developer.nvidia.com/t/is-it-possible-to-use-both-openmp-cuda-in-pgi-fortran/132144/1 was that his variables were declared allocatable, while mine were defined with fixed sizes.
Setting all device variables (including scalars!) to allocatable, declaring them as private, then allocating them within the parallel construct resolved both the Copy Access Error and the Active Process Error. Is there an easier way around these issues? It seems that declaring an unallocatable device variable as private in a parallel construct counts as initial access to a device and locks all CPU threads into a single card, no matter if data was transferred before calling cudasetdevice().
If there is a better way to approach this, please let me know. I can’t put everything into a subroutine, because the GPUs need to swap data at each timestep. I also cannot use MPI for this portion because it is a subsection of a much larger code.