Invalid context error with OMP & GPU

I have the following code, which repeats a large number of calculations for every element in vector y and returns the results in vector z. The main program makes numerous calls to this subroutine, which compiles without error and appears to execute without a problem.

subroutine SingleGPU(nd, nx, ny, x, y, z)
use accel_lib
integer :: nd, nx, ny, i, j
real :: x(nd,nx), y(nd,ny), z(ny), v(nx), p(nx)
!$acc region do private(j, v, p)
do i = 1, ny
	p = 1.0
	do j = 1, nd
		v = y(j,i) - x(j,1:nx)
		p = p * ( .9375 * (1.0 - v**2)**2 * (abs(v) < 1.0) )
	end do
	z(i) = sum(p)
end do
!$acc end region
return
end subroutine SingleGPU

However, I have three C2050s and would like to use all of them. To spread the workload among multiple accelerators, I modified the code as follows.

subroutine MultiGPU(nd, nx, ny, x, y, z)
use accel_lib
use omp_lib
integer :: nd, nx, ny, i, ilo, ihi, j, ndevices
real :: x(nd,nx), y(nd,ny), z(ny), v(nx), p(nx)
ndevices = acc_get_num_devices(acc_device_nvidia)
!$omp parallel private(i, ilo, ihi, j, v, p, y, x) num_threads(ndevices)
call acc_set_device_num(omp_get_thread_num(), acc_device_nvidia)
ilo = omp_get_thread_num() * (ny/ndevices + 1) + 1
ihi = min(ny, ilo + (ny/ndevices) + 1) - 1)
!$acc region do private(j, v, p)
do i = ilo, ihi
	p = 1.0
	do j = 1, nd
		v = y(j,i) - x(j,1:nx)
		p = p * ( .9375 * (1.0 - v**2)**2 * (abs(v) < 1.0) )
	end do
	z(i) = sum(p)
end do
!$acc end region
!$omp end parallel
return
end subroutine MultiGPU

Within the accelerator region, the only difference between this and the first version of the code is the addition of the variables ilo and ihi to divide the workload among the available devices. I’ve checked omp_get_thread_num(), ilo, and ihi prior to entering the accelerator region. All are returning the expected values. This code compiles fine and appears to execute fine the first time it is called, but when called a second time it fails and returns the following message:

call to cuModuleGetFunction returned error 201: Invalid context
CUDA driver version: 3010

I’m at a loss. Can someone please help me understand what’s going on here?

It might be a problem calling acc_set_device_num() more than once. Try putting that in a conditional so it only happens once.

Brent, thanks so much for your help. I assumed that acc_set_device_num() could be called anytime outside an accelerator region, so that processing could be redirected at any point and as often as needed. As you’ve suggested, however, that is not the case. I convinced myself of this by inserting a call to acc_shutdown() just before ending the omp thread. The program then runs without the error, but so slowly that I’d be better off confining all work to a single device. As currently written, the program can’t work as intended if I place the call to acc_set_device_num() in a conditional, as you suggested. Guess I’ll need to revise the flow of work in the main program.

Maybe it’s only me, but this seems to be a real limitation. Can anyone from PGI comment on the chances of improving on this in future revisions?

Hi Kim,

I’ll put in a feature request asking for a runtime function that checks if a device context has been created or not or possibly have acc_set_device be a no-op if the context is already set.

Thanks,
Mat

Thanks, Mat. That would be very helpful.