Running OpenSceneGraph / OpenGL previews on Quadro cards - Heavy CPU peaks

Hello everyone,

Thank you for your time and attention!

I am a software developer working on a software which includes multiple, built-in, real-time OpenSceneGraph / OpenGL previews. In the recent months, several costumers reported that their CPUs stall heavily when using our software. The affected computers always feature specifications similar to the below:

  • Intel multi-core CPU (e.g. Xeon, Core i7)
  • NVIDIA Quadro K600 / K620 / FX 770M / 1000M / 2000M
  • Windows 7 Pro x64 / Windows 7 Ultimate x64 / Windows 10 Pro
  • Various NVIDIA drivers in the range of 353.xx to 354.xx

Thus, we have set up the following test system:

  • Intel Core i7-3930K
  • NVIDIA Quadro K600 by PNY
  • Windows 7 Ultimate x64
  • NVIDIA driver 354.56

On said system, the symptoms are the following:

  • When running our software with previews enabled, the CPU usage sporadically spikes at 100 % on all cores for about 1 to 10 seconds.
  • In this time, our software as well as most other processes are unresponsive.
  • Apart from these peaks, the CPU usage is < 10 %.
  • The frequency of the peaks varies from about 15 minutes to 20 hours, depending on the test case / software configuration / system configuration (please see details below).
  • The following is a screenshot of the CPU usage history right after such a peak: https://www.dropbox.com/s/h6pnbq3fgkc5uoq/Performance_Graph.png?dl=0
  • Furthermore, Process Explorer tells me that the complete CPU usage is split amongst several threads of our application during such a peak.

I searched for references to similar problems in the forums of NVIDIA, OSG, and OpenMP, and followed some sparse hints without success (keywords: “CPU 100%”, “Quadro” in conjunction with “OpenMP”).

In order to track down the origin of the issue, I did a number of tests. Here are the main test results:

Mandatory circumstances to reproduce the problem:

  1. Rendering with the Quadro card
  • When the graphics card in the above-mentioned test system is replaced with a GeForce GTX 670 (driver 361.43), the problem disappears.
  • There are neither reports from our team nor from our customers about similar problems with other graphics cards.
  • Using only release builds
    • Starting debug builds or release builds with the debugger attached does not reproduce the problem (I am using Visual Studio Professional 2013 Update 5).
  • At least one OpenGL context must have been created
    • Our software consists of multiple tiers, finally linking against OpenSceneGraph 3.0.1 and Qt 4.8.7.
    • The previews are basically osgQt::GLWidgets driven by osgQt::GraphicsWindowQt. After creation, each GLWidget is embedded in a common QWidget.
    • Furthermore, we use osgViewer::CompositeViewer in single-threaded mode.
      • If I comment out the creation of the GraphicsWindowQt (hence skipping OpenGL context creation and deactivating the previews at compile time), the problem does not show up (60+ hours test case).
      • In contrast, one can disable the previews at run-time by which already initialized OSG resources (and presumably OpenGL contexts) are cleaned up again. In this case, the peaks still occur, although less frequently.
      • I had other test cases where the previews / contexts only got created, but not triggered continuously, and the problem showed up frequently.
      • In order to exclude OSG and Qt, I replaced the preview programmatically with a dummy window using wglCreateContext() and just displaying one gluSphere(), but the peaks still appeared (once after 8.5 hours).

    Question 1: Is this behaviour indicating issues with the OpenGL context creation or beyond?

    Circumstances to increase the frequency of the problem:

    1. Increasing overall multi-threading
    • Apart from the previews, our software makes heavy use of CPU-side task parallelism as well as data parallelism (for color calculations, networking, etc.). Therefore, we create own worker threads and use OpenMP (#pragma omp parallel for).
      • In the tests it turned out that when increasing the count, the workload and especially the frame rate of the threads makes the problem more reproducible (up to < 15 minutes).
      • In contrast, deactivating OpenMP via compiler option even made the problem disappear (26 hours test case).

    Question 2: Apart from common run-time incidents, I cannot see any dependencies between our use of multi-threading and the presence of a Quadro card. Am I missing any known issues referring to Quadro cards and e.g. OpenMP?

    Circumstances not influencing the problem:

    1. Using osgViewer::CompositeViewer in the various multi-threaded modes does not change the situation.
    2. I installed the NVIDIA drivers from scratch and usually tested with the base profile default settings, but I also played with some specific options in the Control Panel. Especially, I tried the Threaded Optimization option with On and Off, but the peaks always occured after a short time.

    Question 3: Do you have any thoughts on the above-mentioned observations, suggestions for further test cases to single out specific factors, or ideally a fix for the problem?

    It is highly appreciated.
    Thank you very much!

    Kind regards,
    Matthias

    Thanks for the detailed description.

    If you already tried the Threaded Optimization in the NVIDIA Control Panel, that covered one thing I would have recommended to experiment with.

    Another thing to try would be a different application profile.
    If your OpenGL application generates a lot of dynamic load and/or needs to keep some continuous baseline performance, please try first if the profile “Workstation - Dynamic Streaming” changes the behavior. It’s explicitly meant to address these use cases.

    Hello,

    Thank you for the fast and precise reply! It really pointed me into a new direction.

    First, I double-checked the Threaded Optimization options, but the peaks always occured after a short time, as mentioned before.

    Second, I tested the profile “Workstation App - Dynamic Streaming” which has solved the peaks on our test system for two 20+ hours test cases so far. We will run further tests as well as inform our customers to try these settings. If there is any feedback, I will post more results here.

    On our test system, apart from the solved peaks, said profile interestingly changed several other CPU usage characteristics in a positive way:

    • The overall CPU usage decreased from around 10 % to around 5 %.
    • In total, the CPU usage seems to be more steady. Although the workload is distributed more unevenly amongst the cores, the usage history is more constant regarding each single core. I have seen these characteristics only in debug builds yet, where the peaks did not show up. This is another positive sign, I assume.
    • We can measure the run-time behaviour of specific threads in our application. It turned out that they run more constantly, too.

    While these improvements are appreciated, they do raise a couple of further questions:

    What exactly do you mean with “generate a lot of dynamic load”?

    • In our previews, the texture contents are continuously updated (e.g. with 50 FPS). In contrast, the texture sizes and the geometries are changed on demand only, which is rarely the case.
    1. Is this part of the targeted use case of the profile?
    2. Could you reference any best practices to reduce the dynamic load in such a case?

    How can we ensure that the profile is always automatically applied to our software?

    • For short periods, I could observe the previous, inferior CPU usage characteristics (except the peaks), if other applications (NVIDIA Control Panel, Explorer Windows, Libre Office) ran in parallel to our software. This behaviour happened only rarely and is not reproducible for me.
    • I have searched the NVIDIA Control Panel help for these topics, but it left some details unanswered for me. When I manage the 3D settings in the Control Panel, I have expected the global presets to configure the settings below in a specific way. But when I compare the settings of the base profile with the dynamic streaming profile, there are no visual changes, except that more settings are disabled.
    1. Does the disabled state mean that these settings are set to a fixed value in order to comply with the profile?
    2. Could you clarify how the customized program settings interfere with the global presets, and with each other, if programs with different settings run in parallel?
    3. Do all graphics cards from the Quadro series expose this profile?

    If these questions go beyond the scope of the OpenGL forum, please feel free to redirect me to the appropriate place.

    Thank you for your support!

    Kind regards,
    Matthias

    With dynamic load I meant more “game-like” content with dynamically streamed geometry, alas the profile’s name. Similar to how you declare your usage for VBO buffers in OpenGL, e.g. STREAM_DRAW for geometry.

    This is more game-like and what GeForce boards handle well at default settings. The profile “3D App - Visual Simulation” should behave similarly, as games are basically visual simulations.

    Constantly updating texture data could also be something completely different though.
    Or doing both OpenSceneGraph OpenGL rendering and texture uploads at the same time.

    There are multiple ways to upload textures and you can definitely hit a slow path easily which will require that the driver converts the data on the CPU before transferring it to the GPU and it’s unclear how that would affect the CPU load in your very multi-threaded case without reproducer.
    It’s also unclear why a GeForce wouldn’t show that. There are even some data transfer related hardware differences between the GPUs.

    Could also well be that there is some VRAM memory exhaustion and the OS starts to move stuff around, but 100% load on all CPU threads for multiple seconds during that means either really everything is busy, or polling, or thread scheduling isn’t working efficiently.

    What’s the VRAM load of your application?
    The listed Quadro boards are small compared to what I’m working with.
    Have you ever seen this behavior on a board with more than 2 GB VRAM which seem to be what your list has as maximum?

    You can run nvidia-smi.exe to dump available VRAM memory regularly. Check its command line parameter help.

    From your description it’s not possible to say what’s going on there without a solid reproducer.
    You could file a bug report via the links at the bottom of this page: http://www.nvidia.com/page/support.html

    “Do all graphics cards from the Quadro series expose this profile?”

    Yes, all 3D workstation class Quadro boards have this setting.

    If the Workstation App - Dynamic Streaming profile works for you, yes, it should be possible for you to programmatically enable that profile inside your application or maybe create an own one based on it via the NVAPI. More information here: https://developer.nvidia.com/nvapi
    Check the documentation for the “Driver Settings” related module.
    (I won’t be able to answer any questions about the NVAPI. Not my expertise.)

    Hello,

    Thank you for your clarifications and further references.

    By now, we received positive feedback from two of our customers:

    • The CPU peaks did not appear again.
    • The overall performance is very stable.

    Furthermore, we can confirm your expectations about similar profiles. After testing the most relevant profiles, the results are as follows:

    Profiles which did not show peaks

    • "Workstation App - Dynamic Streaming"
    • "3D App - Visual Simulation"
    • "3D App - Game Development"

    Profiles which did show peaks

    • "Workstation App - Advanced Streaming"
    • "3D App - Default Global Settings"
    • Base profile

    “Have you ever seen this behavior on a board with more than 2 GB VRAM which seem to be what your list has as maximum?”

    No, there are no reports of this behaviour for Quadro cards with more than 2 GB video memory. Thus, we cannot prove this fact for the moment.

    Besides that, we followed your advice to dump the GPU memory usage with nvidia-smi.exe. At the same time, we profiled several other parameters with the following command line:

    nvidia-smi.exe --query-gpu="timestamp,pstate,memory.used,memory.total,utilization.memory,utilization.gpu,clocks.current.memory,clocks.current.sm,clocks.current.graphics,temperature.gpu,fan.speed" --format=csv --filename="GPUStats.csv"
    

    When using the Quadro K600 with the base profile, the main observations are:

    • When the CPU peaks occur, the process nvidia-smi.exe stalls as well and provides no measurements for the above mentioned parameters.
    • Apart from the peaks, the parameters do not show any abnormalities, e.g. the memory usage is around 200 of 1024 MiB.

    Currently, the profile change solves the issue well for us. We will check, if NVAPI is convenient to configure our preferred profile settings programmatically.

    Kind regards,
    Matthias