Watertightness issue with multiple iterations and validation mode off

I’m encountering a strange issue with rays going through a triangle mesh under certain circumstances. I have a set of triangles in a triangle-strip layout; they are indexed with vertices shared between adjacent triangles. The triangles in question are put into a GAS (as OPTIX_BUILD_INPUT_TYPE_TRIANGLES), and there is an IAS over all GASes. I then raytrace with one ray per pixel for a few frames.

The first time this test runs everything works as expected. A ray might occasionally slip through the geometry between components (the overall model is not watertight), but no rays go through the middle of a mesh. However if multiple test iterations run in a single process, I start seeing rays pass through the mesh. The missed rays are not on the same pixels each time, but only a small fraction of the pixels ever show this issue. The problem pixels produce valid results for the geometry behind the mesh (said geometry is significantly behind the front mesh, not nearly close enough for Z-fighting).

Each test iteration creates a new OptiX context with all new buffers, acceleration structures, etc, and destroys them all at the end. Nothing is saved or re-used. When running with multiple GPUs, each GPU is configured separately. No acceleration structures are copied or moved.

Using OPTIX_DEVICE_CONTEXT_VALIDATION_MODE_ALL seems to “fix” the issue. With it enabled I’ve seen no issues even running hundreds of test iterations, where without it I seldom get to 5 iterations.

Any idea what’s going on or how OPTIX_DEVICE_CONTEXT_VALIDATION_MODE_ALL is changing behavior here? Is it possible/useful to set some specific bits in place of OPTIX_DEVICE_CONTEXT_VALIDATION_MODE_ALL to narrow things down?

I’m testing with OptiX 8.0 and CUDA 12.0, though the issue also appears with OptiX 7.5 and CUDA 11.7. I do not see the issue testing with this geometry (and camera configuration, etc) using OptiX 6.5 and CUDA 10.0, though of course there are numerous code changes between that version and the OptiX 7/8 implementation. I do not currently have a reproducer I can send.

Thanks

Windows 10 Pro 22H2
dual Quadro RTX 4000 (same issue occurs using one GPU)
536.67 driver
Visual Studio 2022
CUDA 12.0
OptiX 8.0

If the positions where this happens are not at cracks in your non-watertight mesh (I assume T-vertices), then your descriptions sounds either like an issue inside the BVH build, BVH traversal or some synchronization issue.

The validation mode adds synchronization calls after each optixLaunch to catch any problem or CUDA error inside that launch and not at the next call after this asynchronous launch.
Means if disabling validation mode shows the error and adding synchronization calls to your application around the optixLaunch (and maybe also around the optixAccelBuild calls when that still shows the error) fixes it, then there is something wrong with the data flow inside your application.

What is your acceleration structure hierarchy? That defines the BVH traversal used.
Means which OptixTraversableGraphFlags are you using?
OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_ANY or OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_GAS or OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_LEVEL_INSTANCING?

If single_level or single_GAS, does the behavior change when using allow_any?
Or vice versa, if you’re using allow_any and the scene is only IAS->GAS, does the behavior change when switching to single_level?

What OptixPrimitiveType combination did you set in your OptixPipelineCompileOptions usesPrimitiveTypeFlags?
Only OPTIX_PRIMITIVE_TYPE_TRIANGLE? (The default 0 means custom primitives and triangles.)

Another thing where the traversal wouldn’t work as expected is, when using instance and geometry flags which aren’t suited for the algorithm at hand. e.g. things like OPTIX_RAY_FLAG_TERMINATE_ON_FIRST_HIT where the first hit is not necessarily the closest hit, or expecting only a single anyhit without OPTIX_GEOMETRY_FLAG_REQUIRE_SINGLE_ANYHIT_CALL. But then it shouldn’t change behavior between first and following runs of the same test.

If this always works the first time, but starts failing when running the process longer, that could also be a hardware issue, like overheating.
If you have two of the same RTX 4000 boards, does this happen on either of them?

Could you track the temperatures and potentially throttling with nvidia-smi while running the tests?
Something like this command line issued in a command prompt will print the temperatures, clocks and throttling reasons:

nvidia-smi --query-gpu=utilization.gpu,utilization.memory,temperature.gpu,temperature.memory,power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.gr,clocks.max.gr,clocks.sm,clocks.max.sm,clocks.mem,clocks.max.mem,clocks.video,fan.speed,pstate,clocks_throttle_reasons.active --format=csv -l 1

Is this a random ray or always at the same position? In the latter case, you could try to limit the launches to only that launch index and try to isolate the optixTrace input values leading to that result to see if there is anything different.
Or when you’re able to detect that case, i.e. print debug information from the device code only when the closest intersection wasn’t with the expected primitive.

Then there could be some driver issue where the BVH build or traversal is not working as expected.
Please try different newer display drivers as well. Usually bigger changes happen between driver branches.

If nothing of this helps we would require some reproducer in failing state to investigate that.

After more testing it turns out this problem can occur on the first OptiX use iteration in a given process. It’s much less common than failing on the second iteration, but it does happen occasionally.

Putting deviceSynchronize calls (looped with cudaSetDevice for the multi-GPU case) before and after every optixLaunch, optixAccelBuild, and anywhere else that seemed vaguely relevant had no effect.

The acceleration structure hierarchy is a collection of GASes (all triangles in this test) which are inside one IAS. No other nesting, instancing, motion transforms, etc are used. traversableGraphFlags is OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_LEVEL_INSTANCING, but changing that to OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_ANY does appear to avoid the issue.

usesPrimitiveTypeFlags is OPTIX_PRIMITIVE_TYPE_FLAGS_TRIANGLE for this test. Changing that to zero had no effect.

There are no special ray or instance flags being used in this test, just OPTIX_RAY_FLAG_NONE and OPTIX_INSTANCE_FLAG_NONE.

The failure can occur with either or both of my GPUs. Temperatures seem OK (40s/50s).

The failure locations are random, though not uniformly so, and I don’t know which pixels will fail in advance. I started testing with an extremely narrow field of view so that all rays are nearly identical and they all hit the same triangle when it’s working correctly. This showed that the “misses” can occur in the middle of a triangle many pixels from an edge. It also made detecting incorrect results in the GPU thread easy. Printing optixTrace inputs for the failing and working rays showed no difference. Adding a retry loop (optixTrace again with the same inputs when the wrong surface got hit) “fixes” it. The second ray cast from the same GPU thread with the same inputs hits the correct triangle in every case I’ve seen.

Following on to that experiment, I removed the retry logic and added a nonsense optixTrace call (with tmin=tmax=0.0 so it never hits anything) before the real optixTrace. This also seems to be reliable in avoiding the issue.

I tried driver versions 537.70 (production) and 546.01 (new feature) but got the same behavior as my previous 536.67 version.

So it seems that I can avoid the issue with validation mode, allowing any traversable graph even though I only use single level instancing, or an extra optixTrace call. Do those have anything in common other than timing effects? Perhaps triggering some initialization prior to the trace traversal?

Thanks for all the tests.

Yes, the OptixTraversableGraphFlags and valdation mode can affect the BVH traversal implementation.

That was all I can think of and we would need a minimal reproducer in failing state to be able to investigate what is really going on there.