SER causes data corruption

Hi,

I am trying to add support for SER to our renderer and while I am seeing pretty good speedups (anywhere from 20% to 100%) I am experiencing some data corruption in my per ray data. The datablock we need is pretty large (784 Bytes) but from a performance point of view this is still fine it seems. However, after calling optixReorder some data gets corrupted. Interestingly not immediately after the reorder call, so the data returned from the closest hit shader is still fine but a bit later in the loop some values will simply flip to NaN. When I don´t call optixReorder everything is fine (that is without recompiling any code, so just reading runtime variable to switch it on or off).
We are still on Cuda 12.8.1 (can´t upgrade since our clients can´t install a driver for a newer version yet), Optix 9.0 and we are compiling to ptx (I also tested optix-ir but it shows the same issue, it is just slower), 595.59 driver (will test the latest update as well) on two Blackwell Max-Q with SLI enabled for faster datatransfer.
Does anyone have an idea what might cause such data corruption? Is there a way I can try to detect where exactly the bits flip?

Hi @michael.nikelsky,

The first thing I think we should validate is whether your stack size is set correctly & conservatively, as well as whether OptiX is respecting your stack size request correctly, since if the final stack size is too small for any reason, it’s possible that you could end up with stack corruption after SER that you didn’t notice when SER is off.

An easy experiment is to increase the stack size and see if the symptoms go away. Try adding some padding to the values you pass to optixPipelineSetStackSize() and let us know if that affects the symptoms.

You could also think about moving things around in your stack by reordering variables, changing their sizes, or adding sentinel entries or padding around your variables or your entire stack frame for detecting corruption.


David.

Hi David,

I just tried doubling all the values (execept the traversal graph depth) and it didn´t not change anything. Our stack sizes are already pretty large due to some big continuation callables for subsurface scattering, volume scattering and so on, but the scene I am testing is basically just containing a lot of simple materials that do not need a lot of stack space. So I don´t think that is related.

I also added a bunch of checks for NaN and turned out that the likelyhood of the data getting corrupted decreased dramatically. With the checks in place I now get about one corrupted data every 10 to 20 frames or so while before I was getting about 20 to 30 corrupted data blocks every frame.
Also interesting is that the corrupted data with all the checks in place was the value that was returned/written in the closest hit program.

I will try to check some more places and check what happens in a debug compile tomorrow (if that works for us, we had a lot of issues with stack size exceeding the 64k limit in the past in debug mode).

Kind regards,

Michael

Just wondered about another thing: We call our trace function with a pointer to our per ray data block, so something like traceRay( OptixTraversableHandle root, PerRayData* _restrict_ perRayData)

In the function we then get the address of these perRayData block and store it in a union
typedef union

{

    PerRayData* m_ptr;

    uint2 m_address;

} **PerRayDataPtr**;

PerRayData::PerRayDataPtr prd;

prd.m_ptr = perRayData;

And finally pass this to the traversal and invoke function:
optixInvoke(prd.m_address.x, prd.m_address.y);

What I wonder now is how the address of the pointer is actually affected by the reorder since, as far as I understand it, everything on the stack is migrated to another threads local stack memory.
And if it changes, how I can get the updated address.
Interstingly the rendererd images look mostly correcty except for these occasional corrupted data.

Kind regards,

Michael

Ok, turned out it this was just some uninitialized variable that didn´t play along. Not sure why this didn´t cause issues without SER but it works now.

Thanks
Michael

2 Likes

Oh whew! Thanks for reporting back. I was asking around to find out what things can go wrong, and tracking down what the compiler does with pointers to stack memory. The short story is that it should just work, pointers to stack memory should be valid across reorder calls, and it sounds like it’s working for you. I believe the optixPathTracer SDK sample currently demonstrates this too. One tip that came up and my likely next suggestion was to try zero-initializing everything on the stack. Maybe would have helped, I’m mentioning it for posterity for others reading this later. Either way, I’m glad to hear it was an easy fix.


David.