CreatePixelShader() not going asynchronous. Driver issue?

steel_3d · October 18, 2013, 7:01pm

Our shader library contains about 2500 vertex and 2500 pixel shaders that are the output of fxc. We load each of them from disk and call CreatePixelShader(), in a relatively tight loop.

This is generally done in 3 seconds, and we proceed with the rest of the game load, and everybody’s happy.

But randomly, very annoyingly, this CreatePixelShader() loop can take 70s. There’s no in-between. It’s either 3s or 70s.

From what I could gather, and from what I see in vtune and task manager, the driver defers the optimization (compilation??) of shaders to allow games to load faster. When this happens, I see a big hump of cpu usage that only ends about 20-30s after the app’s done loading and not using many CPU resources itself. This seems to be the driver’s background optimization pass. This is fine, I’m in the game and ready to play much quicker.

In the bad case, CreatePixelStader() runs synchronously, taking 10-500ms of time per call versus <0.1ms per call in the async case, less CPU cores are utilized during this slow 70s boot, and there is no high CPU utilizaton hump later. So all of the compilation must have been done synchronously.

According to vtune, the CPU resources used in both the fast and slow cases are from an extremely deep recursive stack with a mix of NVAPI_Thunk and OpenAdapter10 calls inside nvwgf2umx.dll.

This happens with all driver versions, including the latest 331.40.

To reduce the likelihood of the slow load, I’ve added a 3s sleep before the shader library load, and I added a printf between each shader load. Messing with timing in this way seemed to allow the driver to defer the optimization step more often, but we still get the slow loads too often. I don’t understand what is really affecting the driver’s determination of whether to optimize synchronously or asynchronously.

Can anyone shed some light on this behavior?

Thanks a lot!

steel_3d · October 23, 2013, 1:06am

Just to follow up: The problem happened when we submitted Create*Shader() calls from multiple threads instead of all from the same thread (this was random due to our job scheduler). This causes the driver to go into synchronous shader compilation mode, and you’re supposed to handle loading the cores with shader compiles yourself. We happened to be serializing those calls, which made for the really long load times.

All this is mentioned in a GDC presentation called DX11PerformanceReloaded. http://developer.amd.com/wordpress/media/2013/04/DX11PerformanceReloaded.ppsx

Topic		Replies	Views
Can you tell me how DX11 CreatePixelShader works on a PC? General Topics & Other SDKs	6	913	April 21, 2022
Leverage the GPU to accelerate shader compilation? Drivers - Linux, Windows, MacOS	1	201	November 25, 2025
Async shader compile/link OpenGL	0	1298	April 13, 2016
Bugs with GL_ARB_parallel_shader_compile OpenGL	12	3073	May 3, 2019
Strange behavior "waiting for shader compilation" on every new parallel running process General Discussion	0	555	August 17, 2023
We are experiencing like a memory leak using the glMaxShaderCompilerThreadsARB OpenGL	16	1874	July 19, 2023
Driver Issue ? OptiX	2	859	July 25, 2019
Asynchronous building of acceleration structures. OptiX	1	577	January 23, 2019
OpenGL Compute Shader unusually slow OpenGL	2	1882	July 11, 2022
Avoid synchronization in optixLaunch OptiX	5	956	July 21, 2022

CreatePixelShader() not going asynchronous. Driver issue?

Related topics