Context crashing and unable to recover

I have a path tracer based program which crashes at some point, after a certain amount of frames.

I’m getting this error:

Unknown error (Details: Function "_rtContextLaunch2D" caught exception:
Encountered a CUDA error: cudaDriver().CuEventSynchronize( m_event ) returned (700): Illegal address)

I figured this error is thrown from the launch function:

context->launch(0, camera.width(), camera.height());

So I wrapped it with a try - catch mechanism so it doesn’t crash the whole program.
Next, if I try to either launch the context again or run “context->destroy();” in order to redefine the context and run it again, I’m getting the following error:

Unknown error (Details: Function "_rtVariableSet1ui" caught exception:
Assertion failed: "!m_launching", file: <internal>, line: 211)

So what I understand from this error is that because the context launch crashed, the “m_launching” flag hasn’t been turned off, and any attempt to reference either the context or the glfw window results in an immediate crash.

I’ve look nearly everywhere online and in the documentation and haven’t found any mention of the “m_launching” variable.

How can I resolve the issue or at least manage my exceptions properly?

Using Optix 6.0.0

What I found is that these type of crashes are stable when you dealt with them. Meaning that if you found and fixed the issue, so far to me it has always been something that doesn’t creep up again. So if you fixed it in dev, chances are it’s also fixed in prod.

However depending on the severity I either need to restart the app, the dev environment or the whole box (though very rarely)

I will probably be working on a cloud solution soon, so the way I plan to go about this right now is to have different services checking on each other and restarting as necessary.

I have encountered quite a few similar issues during development.

“Unknown error” and returned (700): Illegal address, usually mean memory access is out of range, which could be caused by many reasons, such as reading/writing a buffer out of range from user code or maybe optix inner code.

When it happens, I usually try to narrow down the problem to find exactly which code/modification that OptiX doesn’t like. Most of the case are user code issues.

I’m having the same issue. Surely there must be a way to destroy a context that failed to launch and make a new one. I can’t find anything in the documentation on how to do this. It seems like one a launch fails for any reason, I have to restart the whole application to make it work again.

If this is with earlier OptiX versions before 7.0.0, then restarting the app is probably the only solution.
You would need to be able to destroy the underlying CUDA device contexts because they are probably in an erroneous state after that failure. But that is not an option inside the OptiX API before 7.0.0.
OptiX 1 - 6 use the existing primary CUDA context on the device like the CUDA runtime API to allow OptiX/CUDA interoperability and that doesn’t get destroyed with the OptiX context because it’s owned by the application process.

In OptiX 7 all CUDA context and resource management is explicit and fully under your control.
Means you can create and destroy CUDA contexts per device at will if needed (at least with the CUDA Driver API).
It’s the more modern OptiX API, explicit, multithreading safe, always faster, and more future proof. It’s highly recommended to port older OptiX applications over to that.