cudaSetDevice failing


I’m trying to use multiple GPUs on a linux machine which has 2 GTX Titan Z cards (4 physical GPU chips).

I use 4 OpenMP thread so that I can associate each CPU thread with one GPU chip.

But I get some errors which I cannot understand.

Failing in Thread:4
Failing in Thread:2
Failing in Thread:3
call to cuDevicePrimaryCtxRetain returned error 709: Context is destroyed or not yet created
call to cuDevicePrimaryCtxRetain returned error 709: Context is destroyed or not yet created
call to cuDevicePrimaryCtxRetain returned error 709: Context is destroyed or not yet created
Failing in Thread:4
call to cuDevicePrimaryCtxRelease returned error 4: Deinitialized
Failing in Thread:1
call to cuModuleLoadData returned error 300: Invalid Source
Error: _mp_pcpu_reset: lost thread

Following is the code section where error occurs.


!$OMP PARALLEL PRIVATE(tid, ierr, cuProperty, PinCount, PinBeg, PinEnd, maxFsr)
tid = OMP_GET_THREAD_NUM(); ierr = cudaSetDevice(tid)
CALL ACC_SET_DEVICE_NUM(tid, acc_device_nvidia)

ierr = cudaGetDeviceProperties(cuProperty, tid)
cuDevice(tid)%cuSMXCount = cuProperty%multiProcessorCount
cuDevice(tid)%cuArchitecture = cuProperty%major
cuDevice(tid)%cuWarpSize = cuProperty%warpSize
cuDevice(tid)%cuMaxThreadPerSMX = cuProperty%maxThreadsPerMultiprocessor
cuDevice(tid)%cuMaxThreadPerBlock = cuProperty%maxThreadsPerBlock
cuDevice(tid)%cuMaxWarpPerSMX = cuProperty%maxThreadsPerMultiprocessor / cuProperty%warpSize

SELECT CASE (cuDevice(tid)%cuArchitecture)
CASE (2)   !--- Fermi
  cuDevice(tid)%cuMaxBlockPerSMX = 8
CASE (3)   !--- Kepler
  cuDevice(tid)%cuMaxBlockPerSMX = 16
CASE (5)   !--- Maxwell
  cuDevice(tid)%cuMaxBlockPerSMX = 32
CASE (6)   !--- Pascal
  cuDevice(tid)%cuMaxBlockPerSMX = 32

cuDevice(tid)%cuWarpPerBlock = cuDevice(tid)%cuMaxWarpPerSMX / cuDevice(tid)%cuMaxBlockPerSMX
cuDevice(tid)%cuThreadPerBlock = cuDevice(tid)%cuWarpPerBlock * cuDevice(tid)%cuWarpSize

IF (cuDevice(tid)%lFullWarp) THEN
  cuDevice(tid)%sharedMemoryDim = cuDevice(tid)%cuThreadPerBlock
  cuDevice(tid)%sharedMemoryDim = 2 * ng

!$ACC ENTER DATA COPYIN(cuDevice(tid))
!$ACC ENTER DATA COPYIN(cuDevice(tid)%FsrBeg, cuDevice(tid)%FsrEnd)
!$ACC ENTER DATA COPYIN(cuDevice(tid)%PinBeg, cuDevice(tid)%PinEnd)
!$ACC ENTER DATA COPYIN(cuDevice(tid)%DcmpRayList, cuDevice(tid)%DcmpRayCount)


Can you point out what I’m doing wrong here?

Maybe calling cudaSetDevice and ACC_SET_DEVICE_NUM together has some potential risks?


I was able to recreate the issue here but will need to have one of our compiler engineers investigate what’s wrong. For reference, I’ve logged this issues as TPR#23562.

Typically using both cudaSetDevice and acc_set_device_num works fine, but it looks like we’ve never tested them together in an OpenMP code. Can you try running without the call to cudaSetDevice?

  • Mat

Thanks, Mat.

That works.

So, does acc_set_device_num() fully integrate cudaSetDevice?

Then, is it always safe to use acc_set_device_num() only?

I just got the same in PGI 16.5 using MPI (for multi-GPU on a single node) together with CUDA & OpenACC. The program seems to exit immediately after MPI startup, no other prints. However, there seems to be some kind of error state involved from an error that I got earlier with the same unchanged code - from what I could observe it would run some CUDA kernels fine, then get into an addressing error as soon as OpenACC kernels are called, then exit. All subsequent calls to the same program result in error 709 in cuDevicePrimaryCtxRetain immediately at startup. Interestingly this gets printed three times when trying to use four GPUs - so probably for every process but the one with device ID 0.

Quick update: Updating to CUDA 8 / PGI 16.9 and then removing acc_init from the setup worked for me. After that I’m able to run OpenACC and CUDA Fortran code together on multi-GPU using cudaSetDevice and acc_set_sevice_num.

For anyone coming across this thread with the same issue: it should be resolved with release >18.10