I am stuck. I have an application that uses Optix (currently on 6.0.0) to render some pictures. Sometimes, I get a failure to launch for one reason or another. Maybe I ran out of GPU memory, or maybe I wrote a bug in the kernal, etc.
The point is, I want to be able to recover from rare failures that I haven’t seen yet by deleting my context and rebuilding it. But it seems like once a launch fails, there is no recovering. Any attempt to destroy the context or anything attached to it throws an exception.
How can I destroy my context??? I figure this should be possible.
I just answered that in the other thread:
(EDIT: Copying here as well to make this thread more consistent)
If this is with earlier OptiX versions before 7.0.0, then restarting the app is probably the only solution.
You would need to be able to destroy the underlying CUDA device contexts because they are probably in an erroneous state after that failure. But that is not an option inside the OptiX API before 7.0.0.
OptiX 1 - 6 use the existing primary CUDA context on the device like the CUDA runtime API to allow OptiX/CUDA interoperability and that doesn’t get destroyed with the OptiX context because it’s owned by the application process.
In OptiX 7 all CUDA context and resource management is explicit and fully under your control.
Means you can create and destroy CUDA contexts per device at will if needed (at least with the CUDA Driver API).
It’s the more modern OptiX API, explicit, multithreading safe, always faster, and more future proof. It’s highly recommended to port older OptiX applications over to that.
Potential steps you could try:
- Switch to OptiX 6.5.0 and see if your launch failures persist.
That should mostly be a recompile of your application with the newer SDK and an update of the display drivers when required.
- If yes, try analyzing if there is anything wrong inside your code which could be responsible for these errors.
- If you still need to be able to shutdown the CUDA context, you might try creating one yourself before you create the OptiX context.
OptiX versions before 7.0.0 will latch onto an existing primary CUDA context.
The CUDA Runtime API manages CUDA contexts more automatically than the CUDA Driver API which is explicit.
The former will create contexts automatically for all visible devices on the first CUDA call, that’s why you normally create that with a dummy cudaFree(0) call. The latter can do that explicitly with cuCtxCreate() per device.
Look for cudaDeviceReset() to see how to destroy that here:
and for cuCtxDestroy() here:
- In the long term, you should port to the new OptiX 7 API where all this CUDA context handling is explicit, which means if you require the CUDA context to be reset resp. destroyed you can do that since that is fully under your control.
My OptiX 7 examples show the difference between CUDA Runtime and CUDA Driver API usage: https://github.com/NVIDIA/OptiX_Apps
In any case, this should never actually happen when everything is working correctly which means there is either something wrong inside the application or inside any of the involved NVIDIA driver modules and the latter can only be fixed if you file bug reports.
I appreciate the response. I think you’re right that I will have to port to OptiX 7 at some point and that will help with this sort of thing. I was able to find the reason my launch was crashing and failing, but I’m just looking for a more general way to recover. I have a long-running backend rendering service and I want it to be as stable as possible. I’d like to say I will never make a mistake in the Cuda code but inevitably there will be bugs now and then.
Thanks for the clear answers.