How to debug 1-second GPU lockups in Mixed CUDA and GLSL graphics program

See the gaps in this Nvprof screenshot: https://i.stack.imgur.com/VovPj.png

I’ve created and am running a real-time CUDA application using 2 GPUs. Both GPUs execute CUDA code, but GPU0 is also doing OpenGL and QML rendering (with some use of shaders). 99.99% of the time, this works perfectly. However, very occasionally there are times in the execution where the GPUs just lock up, halting both the processing/CUDA threads and the UI thread for almost exactly 1.0 second, and also halts any other process that’s running CUDA or graphics.

I posted this about a year ago, but I have tried additional troubleshooting steps and have new information.

I have eliminated all of the cudaMalloc/cudaDealloc calls previously happening on a per-frame basis. This by itself does nothing to improve the problem, although it does make the frame processing rate much faster in between halts.

I’ve also tried restricting all graphics to happen on GPU0 and all rendering to GPU1. This eliminated the 1-second halts. All CUDA kernels and all frame rendering calls happen in under 200ms.

  • What might be responsible for halting the GPUs for 1 second at a time?
  • Does anyone know how I might be able to debug this situation further? Nvprof let's you see the CUDA things that might be halting/occupying a GPU, but gives no indication about rendering events.

Additional info:

  • OS: Ubuntu 14.04 with metacity windowing system
  • CUDA: currently 8.0.61. same behavior on 8.0.44, and also CUDA 6.5.
  • GPU: GTX 980, also GTX 1070, Also Quadro P4000
  • Driver: 375.39, also 375.66, 367.xx 384.xx, 396.xx
  • There is no CUDA/openGL interop in my program

Other steps taken:

  • ran nvidia-memcheck. memcheck, initcheck and racecheck are all clean.
  • turned the fan speed on to 100% to ensure there was no throttling
  • compiled a second process to periodically run short CUDA test kernels. That process halted at the exact same times as the main application.
  • A third non-CUDA command-line process continued running throughout the halts.
  • I have eliminated all of the cudaMalloc/cudaDealloc calls previously happening on a per-frame basis
  • I changed my setDevice calls to restrict all graphics to happen on GPU0 and all CUDA to GPU1. This eliminated the 1-second halts. All CUDA kernels and all frame rendering calls in this configuration happen in under 200ms.