Driver Issue ?

Report:
(NOTE: the problem is solved for me for now; but maybe the solution is a speed decrease)

since driver 431.36 (I updated from 419.67 on Jul 11th) my pathtracer application (using OptiX) freezed a lot of times (which it never did before this way).
The application has a heavy kernel load (I tried to implement @c_schied’s great ASVGF filter https://cg.ivd.kit.edu/atf.php https://cg.ivd.kit.edu/publications/2018/adaptive_temporal_filtering/a_svgf.zip).
There might be still some bugs in my code, but it never freezed the app.
Also the CUDA kernels copying the output buffer to a DirectX 11 Texture were always working before.

But after installing driver 431.36 sometimes the app freezed (independent from Denoiser On/Off).
Only ending the process through task manager stopped it. CPU core has 100% but GPU has 0%.
Sometimes even some visual artefacts occurred.

Then I tried to use report level 3. With success. Then no freeze occured !
And instead of that I added some screenings through “OutputDebugStringA”; again no freezes.

My current solution: calling cudaDeviceSynchronize(); before each kernel launch.
However, this maybe inefficient, but the freeze completely disappeared !!!

The kernels+buffers are very complex, so I cannot provide a simple reproducer.

System: OptiX 6.0.0 SDK CUDA 10.0 GTX 1050 2GB Win10PRO 64bit (version 1809) device driver: 431.36 VS2017/VS2019 (toolkit v140 of VS2015)

Hi @m1,

It will be really difficult to determine the exact cause without being able to reproduce, but let’s talk through a few possibilities.

Based on what you described, I’m assuming you’re using some CUDA interop here, and possibly allowing some concurrent kernel runs, and/or buffer copies concurrently with CUDA kernels? I’m glad that cudaDeviceSynchronize() is a viable workaround for now. I don’t need to know how the kernels work, but can you describe how your kernels and buffer copies interact with each other? Were you already using cudaDeviceSynchronize() anywhere, or doing anything else to sync your CUDA stream(s) after rendering but before starting the buffer copy? If your buffer copy were to start before your kernel was done rendering, and you copied uninitialized data to the host (and/or to a DX buffer), can you imagine any reasons your code might not work properly? Are you able to investigate the timing difference before and after adding cudaDeviceSynchronize()?

Seeing visual artifacts would be one expected result of starting a buffer copy before the OptiX launch kernel is done rendering, but that typically won’t cause hanging, so it sounds like something more serious. The 100% CPU load while hanging might be a strong clue, you might try using a debugger to break while that’s happening and see if you can pin down which module is hanging – you won’t get OptiX symbols, but you might be able to see whether the code is hanging inside a particular DLL. Even watching it hang with Sysinternals’ procexp might offer some extra clues.

The most recent driver does have some internal changes that perhaps could be affecting you if you’re using multiple CUDA streams. I don’t know of anything in OptiX that would cause a complete hang, and we haven’t had any other reports of that, but I am taking your report seriously and I’d like to figure out a way to verify which side of the API line your hang is caused by, and start the process of getting it fixed if it’s OptiX. I can imagine a scenario where your app perhaps wasn’t synchronizing correctly but still worked in the old driver, and the new driver could be revealing a latent already-existing problem. This could also be an issue with DirectX interop. But reviewing, inspecting and verifying that your synchronization is correct could help narrow things down.

Would you be willing to draw a diagram of the inputs and outputs of each kernel and buffer copy identifying when each of them start and stop and what data each of them depends on?


David.

Hi David, thanx for your answer.

yes, for the output pixel I use CUDA interop (mapping DirectX textures to CUDA);
cudaDeviceSynchronize() was always already used before and after that interop.

I tried to compare launch times (using the exact camera orientation), but from what I see, there the speed seems not to decrease as I first thought. I see no artefacts in this test; but the app hangs.

I sent a diagram with a list all kernels, all buffers and their interactions in a private message.

m1