I’ve implemented the compute-intensive portion of my program in openacc, with a partial speed-up. To utilize both my GPU and my CPUs, I want to do something like this:
! Initialize everything, on both host and GPU
...
do i_t = 1,n_t
!$omp parallel sections
!$omp section
! update arrays on the cpu for this timestep
call UpdateIntegrationStep()
!$omp section
! update gpu-driven simulation for this timestep
call Solve_GPU()
!$omp section
! possibly do some disk i/o on the host for the timestep
if (mod(i_t,10)==0) call SaveStep()
!$omp end parallel sections
! synchronize certain subarrays between host and gpu
!$acc update host(....)
enddo
Without these OMP directives, the code compiles and runs just fine. When I add these directives, I get this error from the thread trying to execute the GPU work:
call to cuModuleLoadData returned error 201: Invalid context
I have tried messing about with “acc_set_device_num”, but with no luck. Any thoughts on whether what I’m trying is possible? Why it isn’t working?
Platform is:
- PGI (Visual) Fortran 14.1 (yes, the new one)
Win 7 64
Dual Xenon Sandy Bridge (12 cores, 24 with HT)
NVIDIA GeForce GTX 650 Ti
Hi Andrew,
Working with openMP and openACC can be a bit tricky. Each openMP thread will create it’s own context either implicitly when encountering an OpenACC region (data or compute), or explicitly by calling “acc_init”. More importantly, you’ll need to manually decompose the problem ensuring the correct data gets to the correct GPU context. (Note that a host thread can create multiple contexts but a single context can’t be shared by multiple host threads).
Given the code snipit and error, it appears that you’re initializing the GPU outside a parallel region, hence the master thread is creating the GPU context. Then when entering the parallel section, a different thread executes the Solve_GPU routine and then tries to access data from the master thread’s context.
Given that it’s nondeterministic as to which thread will execute a given section, you’ll need to encapsulate all of your GPU usage within the “Solve_GPU” routine (or that section). This way it wouldn’t matter which thread executed it. The drawback would be that you would need to copy the data back and forth after each call.
Hope this helps,
Mat
It helps, Mat. Thanks. I think I can see how to accomplish what I wanted with OpenMP/OpenAcc.
It would be something like, every time GPU code is being executed:
!$omp parallel
if (omp_get_thread_num()==) then
! do GPU stuff
else
! any CPU stuff to be executed in parallel here
endif
!$omp end parallel
At the moment, I’ve offloaded the whole simulation to the GPU, but being able to utilize both GPU and CPU would be…dandy.
Alas, I imagine that it’s time to learn MPI.