I’m experiencing slow performance of an application employing CUDA to do all sorts of image processing operations (some functions from the OpenCV cuda modules + a few kernels that I coded myself). However, when I connect through Windows Remote Desktop to the PC that will run the application instead of actually being in front of the screen, and restart the application, performance is correct.
I can share a small application + source code that illustrates the slow-down. The app executes a sequence of CUDA operations in a loop: upload an image to the device, perform various dummy image processing operations, download an image result back to the host. It measures the average execution time of ~100 iterations.
- running while in front of the physical screen: 43 ms / iteration
- running while connected through remote desktop: 33 ms / iteration
Any ideas would be appreciated!
Build environment details:
- C++ application built on Windows 10 with VS2017, using Cuda Toolkit 10.1 update 2
- Code generation setup in the VS project (for my own functions): compute_50,sm_50;compute_60,sm_61
- Linking against OpenCV 3.4.2 built with the same VS and Cuda Toolkit;
- OpenCV cmake settings:
- CUDA_ARCH_BIN=5.0 6.1 7.5
- CUDA_ARCH_PTX=7.0
- OpenCV cmake settings:
Environment in which the application must run:
- Windows 7 x64
- Quadro P4000
- (there is also integrated graphics from Intel present)
- driver tested: 431.86
- screen connected through a HDMI cable to the Quadro P4000
OR- doing a Windows Remote Desktop from a nearby PC