CUDA GPU terminates process at random instances

cmccabe · December 19, 2016, 9:32pm

I am trying to start troubleshooting an error on a virtual server that uses the ubuntu 14.04 OS. Basically what happens (seeming random) is that the GPU stops processing and terminates. What Imean by seeming random is that for 3 runs there is no error then on run 4 the error appears. It has happend 4 times now and about the only consistency is that it appears to error at the same time - cycle 21 (as indicated by the log not included). If I reboot the GPU starts up again and processes normal.
Are there any commands/recommendations that might help me figure out what is going on and I can not find what error code 46 refers to? Thank you :).

Error:

CUDA: gpuDeviceConfig: device added for evaluation: 0:GeForce GTX 970 v5.2
3.99982GB
CUDA: gpuDeviceConfig: minimum compute version used for pipeline: 2.0
CUDA 0: gpuDeviceConfig::initDeviceContexts: Creating Context and Constant
memory on device with id: 0
terminate called after throwing an instance of ‘cudaExecutionException’

±---------------------------------------
| ** CUDA ERROR! **
| Error: 46
| Msg: all CUDA-capable devices are busy or unavailable
| File:
cudaWrapper.cpp
| Line: 127
±---------------------------------------
what(): CUDA EXCEPTION: Error occurred during job Execution!

Robert_Crovella · December 19, 2016, 9:53pm

Error 46 is a CUDA API error which is exactly what is reported in your output (from driver_types.h):

/**
     * This indicates that all CUDA devices are busy or unavailable at the current
     * time. Devices are often busy/unavailable due to use of
     * ::cudaComputeModeExclusive, ::cudaComputeModeProhibited or when long
     * running CUDA kernels have filled up the GPU and are blocking new work
     * from starting. They can also be unavailable due to memory constraints
     * on a device that already has active CUDA work being performed.
     */
    cudaErrorDevicesUnavailable           =     46,

Possibly that exception (and termination) left the GPU busy.
You should find out how (and why) exactly your application is throwing that exception:

terminate called after throwing an instance of 'cudaExecutionException`

The CUDA runtime API does not define exceptions, so that exception is being generated and thrown based on some processing in your application. That might be the best way to get more clues as to what is going on.

Also not sure what you mean by a “virtual server”. If the server is running in a VM, then it may be that the process of placing that GPU in the VM is flakey.

cmccabe · December 20, 2016, 1:14pm

Thank you, yes you are correct in that by “virtual server” I mean server running on a Ubuntu 14.04 VM. Do you have any suggestions as to how to start to figure out what may have caused the error.

 File:
 /sw_results/R_2016_12_05_13_30_48_user_S5-00580-17-Medexome/X0_Y0/acq_0020.
 dat
[b] FileLoadWorker: ImageProcessing time for flow 21: 0.65(ld=0.39 pin=0.05[/b]
 cnc=0.11 xt=0.09 sem=0.00 cache=0.06) sec 16:07:13
 File:
 /sw_results/R_2016_12_05_13_30_48_user_S5-00580-17-Medexome/X0_Y0/acq_0021.
 dat
 CUDA: gpuDeviceConfig: device added for evaluation: 0:GeForce GTX 970 v5.2
 3.99982GB
 CUDA: gpuDeviceConfig: minimum compute version used for pipeline: 2.0
 CUDA 0: gpuDeviceConfig::initDeviceContexts: Creating Context and Constant
 memory on device with id: 0
 terminate called after throwing an instance of 'cudaExecutionException'

It seems the CUDA was interrupted at step in bold and then terminated. Sorry this is all new to me so I am not really sure where to start. Thank you :).