CreatePixelShader() not going asynchronous. Driver issue?

Our shader library contains about 2500 vertex and 2500 pixel shaders that are the output of fxc. We load each of them from disk and call CreatePixelShader(), in a relatively tight loop.

This is generally done in 3 seconds, and we proceed with the rest of the game load, and everybody’s happy.

But randomly, very annoyingly, this CreatePixelShader() loop can take 70s. There’s no in-between. It’s either 3s or 70s.

From what I could gather, and from what I see in vtune and task manager, the driver defers the optimization (compilation??) of shaders to allow games to load faster. When this happens, I see a big hump of cpu usage that only ends about 20-30s after the app’s done loading and not using many CPU resources itself. This seems to be the driver’s background optimization pass. This is fine, I’m in the game and ready to play much quicker.

In the bad case, CreatePixelStader() runs synchronously, taking 10-500ms of time per call versus <0.1ms per call in the async case, less CPU cores are utilized during this slow 70s boot, and there is no high CPU utilizaton hump later. So all of the compilation must have been done synchronously.

According to vtune, the CPU resources used in both the fast and slow cases are from an extremely deep recursive stack with a mix of NVAPI_Thunk and OpenAdapter10 calls inside nvwgf2umx.dll.

This happens with all driver versions, including the latest 331.40.

To reduce the likelihood of the slow load, I’ve added a 3s sleep before the shader library load, and I added a printf between each shader load. Messing with timing in this way seemed to allow the driver to defer the optimization step more often, but we still get the slow loads too often. I don’t understand what is really affecting the driver’s determination of whether to optimize synchronously or asynchronously.

Can anyone shed some light on this behavior?

Thanks a lot!

Just to follow up: The problem happened when we submitted Create*Shader() calls from multiple threads instead of all from the same thread (this was random due to our job scheduler). This causes the driver to go into synchronous shader compilation mode, and you’re supposed to handle loading the cores with shader compiles yourself. We happened to be serializing those calls, which made for the really long load times.

All this is mentioned in a GDC presentation called DX11PerformanceReloaded. http://developer.amd.com/wordpress/media/2013/04/DX11PerformanceReloaded.ppsx