Compute performance degradation except when in Remote Desktop

I’m experiencing slow performance of an application employing CUDA to do all sorts of image processing operations (some functions from the OpenCV cuda modules + a few kernels that I coded myself). However, when I connect through Windows Remote Desktop to the PC that will run the application instead of actually being in front of the screen, and restart the application, performance is correct.

I can share a small application + source code that illustrates the slow-down. The app executes a sequence of CUDA operations in a loop: upload an image to the device, perform various dummy image processing operations, download an image result back to the host. It measures the average execution time of ~100 iterations.

  • running while in front of the physical screen: 43 ms / iteration
  • running while connected through remote desktop: 33 ms / iteration

Any ideas would be appreciated!

Build environment details:

  • C++ application built on Windows 10 with VS2017, using Cuda Toolkit 10.1 update 2
  • Code generation setup in the VS project (for my own functions): compute_50,sm_50;compute_60,sm_61
  • Linking against OpenCV 3.4.2 built with the same VS and Cuda Toolkit;
    • OpenCV cmake settings:
      • CUDA_ARCH_BIN=5.0 6.1 7.5
      • CUDA_ARCH_PTX=7.0

Environment in which the application must run:

  • Windows 7 x64
  • Quadro P4000
  • (there is also integrated graphics from Intel present)
  • driver tested: 431.86
  • screen connected through a HDMI cable to the Quadro P4000
    OR
    • doing a Windows Remote Desktop from a nearby PC

I would guess it is the card updating the video image that is causing your slowdown. When you use WRD there are no graphics to update so you don’t see the slowdown.

This is just a guess but it should be verifiable. Try using just the integrated graphics, not the P4000, and you should see the faster performance if this is the cause.

We did actually test in the past to use the integrated graphics for display (although a different version of the software, with CUDA 9.0 instead of 10.1 and an older Nvidia driver), but counter intuitively, compute was slightly faster when the display was connected to the P4000. Maybe the P4000 was still the one rendering the image to display, and then feeding it back to the integrated gpu?
We’ll have to perform more tests, with/without integrated graphics, and also on Win10.

I did more tests on a group of Win10 PCs, and the issue “miraculously” disappears. Since we were going to migrate everything to Win10 in the near future, I guess this fixes our issues.