Context crashing and unable to recover

I have a path tracer based program which crashes at some point, after a certain amount of frames.

I’m getting this error:

Unknown error (Details: Function "_rtContextLaunch2D" caught exception:
Encountered a CUDA error: cudaDriver().CuEventSynchronize( m_event ) returned (700): Illegal address)

I figured this error is thrown from the launch function:

context->launch(0, camera.width(), camera.height());

So I wrapped it with a try - catch mechanism so it doesn’t crash the whole program.
Next, if I try to either launch the context again or run “context->destroy();” in order to redefine the context and run it again, I’m getting the following error:

Unknown error (Details: Function "_rtVariableSet1ui" caught exception:
Assertion failed: "!m_launching", file: <internal>, line: 211)

So what I understand from this error is that because the context launch crashed, the “m_launching” flag hasn’t been turned off, and any attempt to reference either the context or the glfw window results in an immediate crash.

I’ve look nearly everywhere online and in the documentation and haven’t found any mention of the “m_launching” variable.

How can I resolve the issue or at least manage my exceptions properly?

Using Optix 6.0.0

What I found is that these type of crashes are stable when you dealt with them. Meaning that if you found and fixed the issue, so far to me it has always been something that doesn’t creep up again. So if you fixed it in dev, chances are it’s also fixed in prod.

However depending on the severity I either need to restart the app, the dev environment or the whole box (though very rarely)

I will probably be working on a cloud solution soon, so the way I plan to go about this right now is to have different services checking on each other and restarting as necessary.

I have encountered quite a few similar issues during development.

“Unknown error” and returned (700): Illegal address, usually mean memory access is out of range, which could be caused by many reasons, such as reading/writing a buffer out of range from user code or maybe optix inner code.

When it happens, I usually try to narrow down the problem to find exactly which code/modification that OptiX doesn’t like. Most of the case are user code issues.