Let me start with the conclusion, then I’ll describe how I got there.
Somewhere in nvwgf2umx.dll, there is code that maintains a pool of threads. In an attempt to improve performance, these threads are executing calls to Sleep(0). The idea is probably that this “yields” the processor to other (presumably) more important threads.
The problem is, Sleep is being called so frequently that rather than helping, it’s actually producing a drag on performance. After all, Sleep is a kernel call, so each call incurs a user mode -> kernel mode transition (and back). When pressed, nvwgf2umx is doing tens of thousands of them per second, which is a load all by itself.
Now, how did I come to this conclusion?
I was working with some video playback code, looking for leaks. Playing video at normal speeds meant that waiting for playback of a few thousand frames took (what felt like) forever. By adjusting the timestamps/durations of the video samples, I cut that time down drastically.
Having started down that road, I wondered it it was possible to trim that down even more. The normal method of showing video on Windows runs through a component known as EVR (Enhanced Video Renderer). Properly configured, EVR has the ability to use hardware capabilities for video decoding via DX9. Doing this brought playback time down to about 6.7 seconds for a video that normally runs for more than 2 minutes.
Not being satisfied, I wondered if I could use a later version of DirectX. Presumably later versions would use newer hardware features and might be even faster.
Unfortunately, the EVR does not support anything beyond DX9. However, MS did provide a sample showing how a DX11 EVR might work. I tried that, and indeed, it did improve the times down to 5.4 seconds.
Using DX11 also significantly increased the amount of CPU time being used (nearly double!). And all of it “kernel” time. This seemed odd since the decoding and display work was all being done “in hardware.” Why would DX11 need so much more CPU (as opposed to GPU) time than DX9? Firing up my profiling tools, I soon zeroed in on nvwgf2umx.dll as being the culprit, and Sleep(0) being the call.
Knowing that this probably wouldn’t be considered compelling evidence for anyone but me, I kept experimenting. Eventually I ran into D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS. If I add this flag to the D3D11CreateDevice call, my CPU time suddenly drops back to DX9 levels, without any corresponding loss in performance.
Tempting as it might be to call that a solution, the docs pretty clearly state “This flag is not recommended for general use.” Depending on a debug flag doesn’t seem like much of a solution.
My hope is that someone can do a review of the nvwgf2umx code and come up with a better approach. I don’t know what purpose the Sleep actually serves there: whether it solved a reported problem, or just seemed like a good idea. But it’s clearly not optimal in all circumstances.