Invalid context error with OMP & GPU

Kim_AKeating12934 · October 12, 2010, 7:28pm

I have the following code, which repeats a large number of calculations for every element in vector y and returns the results in vector z. The main program makes numerous calls to this subroutine, which compiles without error and appears to execute without a problem.

subroutine SingleGPU(nd, nx, ny, x, y, z)
use accel_lib
integer :: nd, nx, ny, i, j
real :: x(nd,nx), y(nd,ny), z(ny), v(nx), p(nx)
!$acc region do private(j, v, p)
do i = 1, ny
	p = 1.0
	do j = 1, nd
		v = y(j,i) - x(j,1:nx)
		p = p * ( .9375 * (1.0 - v**2)**2 * (abs(v) < 1.0) )
	end do
	z(i) = sum(p)
end do
!$acc end region
return
end subroutine SingleGPU

However, I have three C2050s and would like to use all of them. To spread the workload among multiple accelerators, I modified the code as follows.

subroutine MultiGPU(nd, nx, ny, x, y, z)
use accel_lib
use omp_lib
integer :: nd, nx, ny, i, ilo, ihi, j, ndevices
real :: x(nd,nx), y(nd,ny), z(ny), v(nx), p(nx)
ndevices = acc_get_num_devices(acc_device_nvidia)
!$omp parallel private(i, ilo, ihi, j, v, p, y, x) num_threads(ndevices)
call acc_set_device_num(omp_get_thread_num(), acc_device_nvidia)
ilo = omp_get_thread_num() * (ny/ndevices + 1) + 1
ihi = min(ny, ilo + (ny/ndevices) + 1) - 1)
!$acc region do private(j, v, p)
do i = ilo, ihi
	p = 1.0
	do j = 1, nd
		v = y(j,i) - x(j,1:nx)
		p = p * ( .9375 * (1.0 - v**2)**2 * (abs(v) < 1.0) )
	end do
	z(i) = sum(p)
end do
!$acc end region
!$omp end parallel
return
end subroutine MultiGPU

Within the accelerator region, the only difference between this and the first version of the code is the addition of the variables ilo and ihi to divide the workload among the available devices. I’ve checked omp_get_thread_num(), ilo, and ihi prior to entering the accelerator region. All are returning the expected values. This code compiles fine and appears to execute fine the first time it is called, but when called a second time it fails and returns the following message:

call to cuModuleGetFunction returned error 201: Invalid context
CUDA driver version: 3010

I’m at a loss. Can someone please help me understand what’s going on here?

brentl · October 13, 2010, 9:06pm

It might be a problem calling acc_set_device_num() more than once. Try putting that in a conditional so it only happens once.

Kim_AKeating12934 · October 14, 2010, 3:36pm

Brent, thanks so much for your help. I assumed that acc_set_device_num() could be called anytime outside an accelerator region, so that processing could be redirected at any point and as often as needed. As you’ve suggested, however, that is not the case. I convinced myself of this by inserting a call to acc_shutdown() just before ending the omp thread. The program then runs without the error, but so slowly that I’d be better off confining all work to a single device. As currently written, the program can’t work as intended if I place the call to acc_set_device_num() in a conditional, as you suggested. Guess I’ll need to revise the flow of work in the main program.

Maybe it’s only me, but this seems to be a real limitation. Can anyone from PGI comment on the chances of improving on this in future revisions?

MatColgrove · October 14, 2010, 11:29pm

Hi Kim,

I’ll put in a feature request asking for a runtime function that checks if a device context has been created or not or possibly have acc_set_device be a no-op if the context is already set.

Thanks,
Mat

Kim_AKeating12934 · October 15, 2010, 3:31pm

Thanks, Mat. That would be very helpful.