cudaSetDevice failing

chlskawo12 · December 31, 2016, 12:58pm

Hello.

I’m trying to use multiple GPUs on a linux machine which has 2 GTX Titan Z cards (4 physical GPU chips).

I use 4 OpenMP thread so that I can associate each CPU thread with one GPU chip.

But I get some errors which I cannot understand.

Failing in Thread:4
Failing in Thread:2
Failing in Thread:3
call to cuDevicePrimaryCtxRetain returned error 709: Context is destroyed or not yet created
call to cuDevicePrimaryCtxRetain returned error 709: Context is destroyed or not yet created
call to cuDevicePrimaryCtxRetain returned error 709: Context is destroyed or not yet created
Failing in Thread:4
call to cuDevicePrimaryCtxRelease returned error 4: Deinitialized
Failing in Thread:1
call to cuModuleLoadData returned error 300: Invalid Source
Error: _mp_pcpu_reset: lost thread

Following is the code section where error occurs.

CALL OMP_SET_NUM_THREADS(PE%nCUDADevice)

!$OMP PARALLEL PRIVATE(tid, ierr, cuProperty, PinCount, PinBeg, PinEnd, maxFsr)
tid = OMP_GET_THREAD_NUM(); ierr = cudaSetDevice(tid)
CALL ACC_SET_DEVICE_NUM(tid, acc_device_nvidia)

ierr = cudaGetDeviceProperties(cuProperty, tid)
cuDevice(tid)%cuSMXCount = cuProperty%multiProcessorCount
cuDevice(tid)%cuArchitecture = cuProperty%major
cuDevice(tid)%cuWarpSize = cuProperty%warpSize
cuDevice(tid)%cuMaxThreadPerSMX = cuProperty%maxThreadsPerMultiprocessor
cuDevice(tid)%cuMaxThreadPerBlock = cuProperty%maxThreadsPerBlock
cuDevice(tid)%cuMaxWarpPerSMX = cuProperty%maxThreadsPerMultiprocessor / cuProperty%warpSize

SELECT CASE (cuDevice(tid)%cuArchitecture)
CASE (2)   !--- Fermi
  cuDevice(tid)%cuMaxBlockPerSMX = 8
CASE (3)   !--- Kepler
  cuDevice(tid)%cuMaxBlockPerSMX = 16
CASE (5)   !--- Maxwell
  cuDevice(tid)%cuMaxBlockPerSMX = 32
CASE (6)   !--- Pascal
  cuDevice(tid)%cuMaxBlockPerSMX = 32
END SELECT

cuDevice(tid)%cuWarpPerBlock = cuDevice(tid)%cuMaxWarpPerSMX / cuDevice(tid)%cuMaxBlockPerSMX
cuDevice(tid)%cuThreadPerBlock = cuDevice(tid)%cuWarpPerBlock * cuDevice(tid)%cuWarpSize

IF (cuDevice(tid)%lFullWarp) THEN
  cuDevice(tid)%sharedMemoryDim = cuDevice(tid)%cuThreadPerBlock
ELSE
  cuDevice(tid)%sharedMemoryDim = 2 * ng
ENDIF

!$ACC ENTER DATA COPYIN(cuDevice(tid))
!$ACC ENTER DATA COPYIN(cuDevice(tid)%FsrBeg, cuDevice(tid)%FsrEnd)
!$ACC ENTER DATA COPYIN(cuDevice(tid)%PinBeg, cuDevice(tid)%PinEnd)
!$ACC ENTER DATA COPYIN(cuDevice(tid)%DcmpRayList, cuDevice(tid)%DcmpRayCount)

!$OMP END PARALLEL

Can you point out what I’m doing wrong here?

Maybe calling cudaSetDevice and ACC_SET_DEVICE_NUM together has some potential risks?

MatColgrove · January 3, 2017, 7:38pm

Hi CNJ,

I was able to recreate the issue here but will need to have one of our compiler engineers investigate what’s wrong. For reference, I’ve logged this issues as TPR#23562.

Typically using both cudaSetDevice and acc_set_device_num works fine, but it looks like we’ve never tested them together in an OpenMP code. Can you try running without the call to cudaSetDevice?

Mat

chlskawo12 · January 4, 2017, 9:35am

Thanks, Mat.

That works.

So, does acc_set_device_num() fully integrate cudaSetDevice?

Then, is it always safe to use acc_set_device_num() only?

MuellerM · January 5, 2017, 6:18am

I just got the same in PGI 16.5 using MPI (for multi-GPU on a single node) together with CUDA & OpenACC. The program seems to exit immediately after MPI startup, no other prints. However, there seems to be some kind of error state involved from an error that I got earlier with the same unchanged code - from what I could observe it would run some CUDA kernels fine, then get into an addressing error as soon as OpenACC kernels are called, then exit. All subsequent calls to the same program result in error 709 in cuDevicePrimaryCtxRetain immediately at startup. Interestingly this gets printed three times when trying to use four GPUs - so probably for every process but the one with device ID 0.

MuellerM · January 9, 2017, 3:02am

Quick update: Updating to CUDA 8 / PGI 16.9 and then removing acc_init from the setup worked for me. After that I’m able to run OpenACC and CUDA Fortran code together on multi-GPU using cudaSetDevice and acc_set_sevice_num.

aglobus1 · December 11, 2018, 5:57pm

For anyone coming across this thread with the same issue: it should be resolved with release >18.10

Topic		Replies	Views
Multi-GPU MPI launch failing when UVM enabled Legacy PGI Compilers	5	3771	January 2, 2019
how to assign device number? Legacy PGI Compilers	3	3160	June 22, 2011
memcopy fails in multiple pthreads with cudaSetDevice() i m unable to use pthread with multiple GPUs CUDA Programming and Performance	5	3281	August 8, 2011
OpenMP, OpenACC and acc_set_device_num Legacy PGI Compilers	12	10776	March 15, 2013
Problem using CUDA Visual Profiler for CUDA Fortran Legacy PGI Compilers	1	5091	August 6, 2012
Multiple GPUs with nvc++ -stdpar nvc, nvc++ and nvfortran	11	1297	January 2, 2024
cudaSetDevice seems completely broken Legacy PGI Compilers	12	15853	December 30, 2010
Strange behavior with multiple host threads using cuFFT CUDA Programming and Performance	5	1596	March 21, 2014
error 709: Context is destroyed or not yet created - pgi13.3 Legacy PGI Compilers	3	5516	March 27, 2013
Failure with independent devices on independent processes Try it yourself! CUDA Programming and Performance	19	3462	March 10, 2011

cudaSetDevice failing

Related topics