See the gaps in this Nvprof screenshot: https://i.stack.imgur.com/VovPj.png
I’ve created and am running a real-time CUDA application using 2 GPUs. Both GPUs execute CUDA code, but GPU0 is also doing OpenGL and QML rendering (with some use of shaders). 99.99% of the time, this works perfectly. However, very occasionally there are times in the execution where the GPUs just lock up, halting both the processing/CUDA threads and the UI thread for almost exactly 1.0 second.
- What might be responsible for halting the GPUs for 1 second at a time?
- Does anyone know how I might be able to debug this situation further? Nvprof let's you see the CUDA things that might be halting/occupying a GPU, but gives no indication about rendering events.
- OS: Ubuntu 14.04 with metacity windowing system
- CUDA: currently 8.0.61. same behavior on 8.0.44, and also CUDA 6.5.
- GPU: GTX 980, also GTX 1070
- Driver: 375.39, also 375.66, 367.xx
Other steps taken:
- ran nvidia-memcheck. memcheck, initcheck and racecheck are all clean.
- turned the fan speed on to 100% to ensure there was no throttling
- compiled a second process to periodically run short CUDA test kernels. That process halted at the exact same times as the main application.
- A third non-CUDA command-line process continued running throughout the halts.