Crashes Since Driver 301.42

Dear OptiX Team,

I have been using OptiX on Windows since 2011 with a legacy code built on top of OptiX. However, for nvidia display drivers that are newer than 301.42 I experience random TDR crashes.

I tested the code on different windows versions; including 7, 8 and server versions and with different hardware configurations; same hardware and platform with different versions of OptiX (including the latest) and CUDA; it still crashes.

When I disabled TDR on Windows; instead of driver crash, the program was blocked indefinitely. I checked GPU memory to see if there was a leak; the answer was no.

I can not reproduce the crashes; they just happen randomly; the crashes occur during ray tracing (rtTrace); usually after assigning diffuse/specular material to objects in the scene.

I know it is very difficult to find the problem without me providing a sample code; however, I don’t know which portion of the big legacy code causes this error.

Please note that as soon as I downgrade to display driver 301.42, there are no more crashes on any of the platform/hardware that I tested. But this is not sustainable – new hardwares do not support 301.42 anymore.

So my question is, is there such a known bug? Has anybody experienced this issue in the past?

What could be the problem? And what would you recommend me try?

So, sometimes it doesn’t work, but you can’t reproduce the problem or give out the code. This is going to be a tough one.

First off, is there anything stochastic in the code? If there is, can you eliminate it temporarily in order to make the code completely deterministic? This will help create a reproducible result.

If there’s absolutely nothing stochastic in the code (i.e., no random number generators), and the error still happens only occasionally, you may have an uninitialized variable or buffer somewhere in your code. When that happens, your input ends up being whatever was in memory before you run your program, which is often NaN. I’m not sure why the older driver version would change this, but I can’t rule it out.

Second, since your program blocks indefinitely when you disable TDR, you should check for possible infinite loops. Make sure for and while loops are guaranteed to terminate eventually. Also, look for loops that might be caused by casting a new ray as a result of a previous ray intersection. I had a bug once where, due to floating point imprecision, some rays that reflected from surfaces far away from the origin would be given starting points behind the surfaces that reflected them. This resulted in an infinite loop of rays hitting and spawning from the same surface. Perhaps you have a similar issue.

There’s a good chance you’ll have to debug your legacy code to find the error. If you can, start by commenting out large portions, and narrow down your search as you eliminate code segments that don’t seem to cause the error. Using rtPrintf to learn what code is being called may also be helpful.

If nothing else works, you can try setting the environment variable OPTIX_API_CAPTURE=./my_trace to save a trace of a failed execution that might be useful to the OptiX team.

Good luck!