OpenMP + CUDA Fortran issue

I noticed while implementing CUDA Fortran + OpenMP that there are distinct issues with declaring device variables private in the parallel section. If I did not declare a device variable private, each CPU thread would correctly assign itself a device, but it would throw a copy error whenever the second thread tried to access the device variable. When I did declare these variables private, the CPUs would not correctly assign themselves devices. The error number was 36 “Active Process Error”. I noticed the main difference from Mat’s post in https://forums.developer.nvidia.com/t/is-it-possible-to-use-both-openmp-cuda-in-pgi-fortran/132144/1 was that his variables were declared allocatable, while mine were defined with fixed sizes.

Setting all device variables (including scalars!) to allocatable, declaring them as private, then allocating them within the parallel construct resolved both the Copy Access Error and the Active Process Error. Is there an easier way around these issues? It seems that declaring an unallocatable device variable as private in a parallel construct counts as initial access to a device and locks all CPU threads into a single card, no matter if data was transferred before calling cudasetdevice().

If there is a better way to approach this, please let me know. I can’t put everything into a subroutine, because the GPUs need to swap data at each timestep. I also cannot use MPI for this portion because it is a subsection of a much larger code.

Brian

Hi Brian,

A device context is created upon first use of a device. So if you have a static device variable declared at the start of your routine, the context is created upon entry since space needs to be allocated on the device to hold these variables. As you discovered, this doesn’t work if you later enter an OpenMP region since multiple threads can’t share a context.

What you need to do is delay the creation of the device variables until after the context is created by each OpenMP thread. As you note, this can be done by making private copies of the variables and then have each thread allocate them after the context is created. A second would be to move your device code into a subroutine that gets called by each OpenMP thread.

I can’t put everything into a subroutine, because the GPUs need to swap data at each timestep

Can you pass in your shared host variables? Granted, I don’t know your code, but it seems like you should still be able to swap host data around even if it’s from a subroutine.

  • Mat