I am developing an application for use in a vision research lab. I’ve posted to the Gaming forums because that seems to be the most related thing to what we do.
In short, our lab generates simple visual stimulus in real time. These usually consist of one or two thousand dots being drawn to screen and arranged in simple geometric configurations ; that should be trivial to render in comparison to a high-end 3D computer game. Image computations are performed using Matlab (I know - it’s slower but supports the kind of rapid code development that we need), while an open source API called PsychToolbox (written in C) is used to translate our computations into OpenGL commands that draw to the back buffer, handle buffer swaps, and handle all X11 interfacing. For us, the most important thing is being able to control the timing of each image and knowing when each frame was drawn to the screen. Therefore, it is pretty terrible that we frequently get these ugly, stuttery frame skips that can last for hundreds to thousands of milliseconds. These occur stochastically ; some days it’s worse, some days it’s better.
Our setup is as follows:
Ubuntu 14.04 with 4.4.0 lowlatency linux kernel
Two processors with 8 cores at 2.6GHz each
Many gigabytes of RAM and a large solid state hard drive
NVIDIA driver 384.111
LXDE desktop environment
3 Monitors connected: DVI-I to VGA adaptor feeds CRT monitor, DisplayPort 1 feeds Acer V196L, DisplayPort 2 feeds NEC MultiSync EA294WMi
3 X screens configured, one X screen per monitor
I seem to be failing to find a way to attach my nvidia bug report to this topic, but I do have it on hand if it is needed.
The CRT displays our visual stimuli, while the other two are controller terminals for the experimenter to use. The two flatscreens that are fed by the DP connectors run at 60Hz. The CRT runs at 85Hz ; the extra frame rate is needed for scientific reasons. All monitors use different resolutions.
I have tried everything I can think of at my IT skill level without risking a complete OS reinstallation ; in the past, I’ve broken the system completely by trying to install the latest NVIDIA driver from scratch. I have matched refresh rates of the monitors. I’ve tried adding ForceFullCompositionPipeline to the xorg.conf file. I’ve tried switching off vertical syncing for the flat screen monitors in nvidia-settings. I’ve tried version 340.XX of the NVIDIA driver. None of these things fix the problem. Please note that I’m no IT wizard and no great expert on configuring Linux, so I have relied on Software & Updates > Additional Drivers to switch driver versions ; this provides an incomplete list. I did try to switch to the open source Nouveau driver but the package manager appears to botch that up because the nouveau module fails to load.
I get some hints by monitoring the threads of my application using the terminal command $ top -H -d 0.5 -p . This reveals a labelled real-time thread created by PsychToolbox called “PTB mainthread”. Now, it bears mentioning that the lead developer of PsychToolbox has told me directly that there should only ever be one “PTB mainthread”. However, on our system, there are many. One of them has a relatively high %CPU usage (presumably the original PTB mainthread), and the rest usually have a low usage of about 4 to 6%. But when an ugly frame skip occurs, these additional “PTB mainthreads” get a sudden spike of %CPU activity. This is in stark comparison with when I try my application on a much less powerful system that uses a cheapo NVS 315. Here, the visual stimuli never experience such ugly frame skips and there is just one “PTB mainthread”. The lead PsychToolbox developer suggested that the extra “PTB mainthreads” that appear with the Quadro K2200 must be created by the NVIDIA driver.
If that’s the case then it looks to me like a classic multithreading problem. One thread grabs a mutex or something and won’t release it when it is supposed to. Then the remaining threads spin lock until the resource is eventually released. In the mean time, anything that’s blocking on the driver will hang. Indeed, my application becomes unresponsive during ugly frame skips because it is blocking on the completion of the next buffer swap.
We are certainly not challenging the hardware. The %GPU usage never exceeds about 35%, the K2200 memory usage never exceeds about 12%, and the PCIe bandwidth never exceeds about 2%. Meanwhile, the system total %CPU hovers around a cool 12% ; as far as I can tell using Linux command line tools, we never fill up our many gigabytes of RAM. In principle, this system should be massively over engineered for our application and frame skips should be rare and short.
NVIDIA support suggests that while a driver bug is not impossible, this is probably a developer error. So, the question is how to find out if it is one way or another. Is there a way to monitor the driver threads directly and unambiguously? Is there a way to monitor each layer of code separately (Matlab <–> PsychToolbox <–> NVIDIA driver)?
I guess that other things I might try are systematically switching to each available NVIDIA driver to see how it performs. I notice that there is a range of Ubuntu NVIDIA driver packages from 340.XX up to 384.XX. However, I do not know how to switch to them safely other than with the Ubuntu Software & Updates GUI, which only lists 340.XX and 384.XX but no others. The Nouveau driver might be nice to try, if only Software & Updates would install that properly. It is possible that there is some conflicting package with my distribution of Ubuntu; perhaps a linux kernel driver or some X11 setting is conflicting with the NVIDIA driver? I can offer some anecdotal evidence that switching to a failed Nouveau installation, purging the remaining NVIDIA packaged, then switching back to driver 384.111 appears to have reduced but not eliminated the probability of an ugly skip.
Otherwise I need to try different hardware. I am waiting for a second graphics card to arrive that will be dedicated to either stimulus generation or program control. NVIDIA support suggested trying an AMD card to see if that works. I’m happy to try that. But if it works then, well, you know.
Any suggestions would be warmly welcomed since I’m at the end of my knowledge and skill level with this problem.
nvidia-bug-report.log.gz (171 KB)