Driver issue with RT_EXCEPTIONs

Hi,
I am working on a demo renderer using Optix6.5 API. I see issues with RT_EXCEPTIONS since I upgraded my driver from 455.something (don’t remember exactly) to 460.89 on a GTX960.

On some of my geometry in run into RT_TRACE_DEPTH_EXCEEDED exceptions for yet unknown reasons. That was fine before on the old driver as in Debug mode I printed all exceptions and set my output buffer to the exception code. In Release however I opted to ignore all exceptions via context->setExceptionEnabled(RT_EXCEPTION_ALL, false); as the visual output was still as expected.

Now with the new driver the trace call fails in Release mode with an error as soon as an exception is encountered (Debug is still fine: printing messages and setting the code to the output buffer).
The error message I get is:

Unknown error (Details: Function “_rtContextLaunch3D” caught exception: Encountered a CUDA error: cudaDriver().CuEventSynchronize( m_event ) returned (700): Illegal address, file: , line: 0)
terminate called after throwing an instance of ‘optix::Exception’
what(): Unknown error (Details: Function “bufferMap” caught exception: Encountered a CUDA error: cudaDriver().CuMemcpyDtoH( dstHost, srcDevice, byteCount ) returned (700): Illegal address, file: , line: 0)

I would like to keep ignoring exceptions in Release due to performance reasons. Maybe this is an issue only for this hardware? I might be able to check this in office against a RTX2060 card too. The RTX2060 and an MX150 also work fine using some older drivers both on Win10 and Linux (versions between 440.xy and 455.xy).
Any ideas are highly appreciated. In my opinion this is a bug but maybe it is expected (new) behavior?
Regards
Toni

EDIT: I played with some legacy drivers available from the download website and found that the last version that is fine for me on Win10 x64 is 452.06 WHQL driver - the next 456.38 WHQL give above issue. From 457.71 I start seeing another error in Release mode on different geometry setup. I did not save the output but it was related to alignment issues and the error number was 716 I think.

Probably there is some error on my side but it kind of bugs me that some code runs perfectly fine up to some driver version and suddenly is broken? I will try to check against some other hardware on Monday maybe…

Small update: I can confirm that the same issue is present on more recent hardware (RTX2060) too. I tested some driver versions that are available for this device. I did NOT see the issue with drivers 445.87 (that was already present on the system), 451.48 and 452.06 (both Studio drivers). The next version I tested was 460.89 (NSD) - here the above mentioned error occurs again.

I also tried a couple of different setups for exception handling but none gives the desired result:
a) not providing an exception program
b) turning off printing from
c) turning off only a certain exception

In any case (with newer drivers) as soon as an exception is encountered, that was explicitly disabled by context->setExceptionEnabled() the trace call terminates with above error. I can get the exception handled quitely but this will mess up the result for that pixel although the rendering could be fine.

So if there are any suggestions apart from “use some old driver” or “omit exceptions” feel free.

Hi @mahonyyy,

You should spend your time finding the root cause reason for the debug mode exception and fix it. It still works to disable exceptions in a Release build for performance reasons, but it is not safe to do so. The reason that it was working in older drivers is luck not design, and you cannot rely on being able to ignore legitimate exceptions. The illegal address error is telling you that you that your luck has run out and now ignoring the exception is crashing. There are multiple reasons this can happen when the trace depth is exceeded. First thing to check is whether you’re calling rtTrace() recursively and allowing the trace depth to go too far.


David.

1 Like

Hi David,
thanks for your answer.
That’s what I did mostly today and I finally figured out the issue was cause by ‘buggy’ geometry and not checking the trace depth in in one case in one of the closest hit programs before launching a shadow ray. Good to know that disabling exceptions may actually crash the trace call - I wasn’t really aware of that.

I wrongly assumed the illegal address was related to some data copy from device to host is failing (the error message kind of indicates something like this - i.e. copy of the exception message buffer) and not exactly related to the exceeded trace depth.

So I will keep on going for the latest driver and fixing all rendering issues directly. Disabling distinct features of the shading finally pointed me to the buggy code piece.

Regards
Toni

1 Like

Hey I’m glad you found it! Yes generally speaking it’s a good idea to always resolve any OptiX exceptions you see before going to release builds.

I wrongly assumed the illegal address was related to some data copy from device to host is failing (the error message kind of indicates something like this - i.e. copy of the exception message buffer) and not exactly related to the exceeded trace depth.

It’s true the error message isn’t great. Think of it as a bad pointer dereference, like accessing memory after you delete it, or dereferencing a pointer that hasn’t been initialized and has a random value. The reason that trace depth can cause this is because you run out of stack space and you can read or write values into memory that is being used by other code. Each call to rtTrace() from within an OptiX shader program needs a stack frame, so if it’s called recursively and you run out, the effect is that OptiX or your shader program can start using memory it doesn’t own, so the effect is the same as using a bad pointer.


David.

1 Like

Thanks again, that makes perfect sense to me now! I guess error messages are often cryptic especially if you did not write them yourself ;-)
Regards
Toni