During testing my OptiX7.5-based Pathtracer application (see app design details)
I solved some “Undefined Behavior” situations.
Always it took much more time, than expected to find the root cause; Validation Mode messages helped a lot, but in a specific case there was no crash, no validation error, no exception; instead on a random render frame, processing over and over again was in a loop within nvcuda64.dll, often it simply did not return from a cudaDeviceSynchronize call or optixLaunch call (see screenshots).
I still was not able to identifiy the “real” root cause for that, although I solved a global logical error, which simply avoids using that .cu kernel for that case. (that kernel was not required, cause it would have simulated an invisible “virtual” geometry; which simply is not being defined at all now, removing the GAS and IAS entry for it)
For now this problem is solved for me. And so my question here is more about, what could have been the underlying technical reason for such situations.
Its clear, when a kernel is not designed properly it may cause problems, but because there was no error message and no validation mode message, I had to search very long time for the root cause of the problem.
The kernel in question (a closesthit program) is designed to handle multiple cases; Somehow the driver seems to repeat something internally which then caused the frozen app state.
I cannot provide a minimal reproducer for this, because its happening in the complex app and I simply don’t know, what exactly went wrong in that kernel. The kernel works for all the other cases very well.
Generally rendering proceeds without problems, as long as the geometry update for that subset is not done; But if its done (on every final frame), then after a random frame (often 5, sometimes 27, or between or later) suddenly the renderer hangs at position shown in the screenshot)
The geometry update (updating vertex buffers and rebuilding GAS + IAS) works without problems on all other cases; The geometry in the failing case was a custom primitive; defined as sphere using a custom intersection program (not the new inbuilt-sphere-primitive!); That geometry works ok in all other cases, when used exactly with the closesthit .cu kernel. So its clear, it was an implementation problem, but its unclear to what the wrong code lead to, cause normally if invalid input data would be the reason, I would rather expect a crash or invalid visual output than a hang.
It also seems not to be a memory issue.
Is it possible for an employee to tell, what type of checkings are going on in the driver address = (RIP_register) - (module base address of nvcuda64.dll) from the screenshots?
So what is the driver attempting to do there?
VS2019 debugging can be paused and resumed again and again, and as you see, then processing is somewhere else in the driver (but the call stack is not changing completely, only the latest entries change; the stack remains unchanged beyond loaded address 0007ffaefef3d94h in DEBUG build case; after subtract module’s base address: 0007ffaefc40000h its RVA: 2b3d94h in .DLL image)
Before that driver version 516.59, I also had the same problem on an earlier driver,
So I could update to the newest driver to check again, but that did not help in the past.
OptiX 7.5.0 SDK
GTX 1050 2GB
Win10PRO 64bit (version 21H1; build 19043.1237)
device driver: 516.59
MDL SDK 2020.1.2
Windows SDK 10.0.19041.0