Sims are no longer deterministic when multi-threading is enabled and rigidbodies overlap

I’ve noticed an issue recently with some of my physx simulations. If I have a group of rigidbodies placed closely together (overlapping/interpenetrating) and run a simulation on them with threading enabled (anywhere from 2-16 workers), the sim results will differ occasionally on successive re-runs, and overall the sim is not deterministic.

Some other info:

-I’m using the latest PhysX repo
-All rigidbodies have the same shape (sphere collider), mass (1) and starting orientation.
-The sim has a single ground collider plane.
-CCD is not enabled.
-This is a CPU sim, not a GPU sim.
-This problem occurs no matter what the sim substeps are set to.
-The fewer the rigidbodies, the less repeatable this problem is. 50 rigidbodies means the jitter only happens every 10th or so re-run. With hundreds of rigidbodies in the same setup, the sim is never deterministic.
-The problem only seems to occur when the rigidbodies are inter-penetrating at frame 0. The more interpenetrations, the easier it is to repeat the problem. I’m not sure if this is the only cause, but it’s the only way I’ve been able to repeat the error.

Is non-determinism a known limitation of multithreaded sims?

We have quite a few tests for determinism, including quite complex cases. It is possible you’ve found an edge case but PhysX should be deterministic, regardless of the number of threads being used.

Could you confirm how you are testing this?

Determinism requires a few things:

(1) The application itself using PhysX has to be deterministic (fixed time-steps, exact same set of actors, actors instantiated in the same order, events occurring on the same frame etc.)
(2) You have to destroy everything and re-create it, e.g. scenes, actors, shapes etc. The reason for this is that there are some internal pools to optimize allocations by re-cycling structures like interactions, but this can change the order in which interactions are processed.

There is also PxSceneFlag::eENABLE_ENHANCED_DETERMINISM, which is required to get deterministic results for a set of bodies when simulated either individually or as part of a larger scene (where the other actors in the larger scene do not interact with the set of bodies that were part of the smaller scene). This flag disables some batching optimizations that can produce different results if constraints are batched together differently.

You mentioned that this issue seems to only occur when bodies are inter-penetrating on the first frame. This may be an edge case - although I can’t think of an obvious reason why this would cause non-determinism in the simulation. Is there any chance you could provide a PVD capture of your case so I can use this to produce a repro?

Could you try enabling enhanced determinism to see if this has any influence on your results?

Thank you for the quick response!

  1. I used fixed time steps, same actors, same order, etc.

  2. Everything does get created and destroyed properly (at least, to my knowledge based on my review of my code…nothing is re-used).

  3. Enabling enhanced determinism does not solve the problem.

I haven’t used PVD yet nor am I sure if it’s compatible with my use of physx (I’m making a plugin for 3dsmax, not a standalone application). I’ll look further into the capture you suggested.

I’ve also done my own custom data dump where I output all vel/pos/tm info about my rigidbodies at each simulation step to a file and a comparison of the data shows that all values are identical between subsequent simulations, up until a point where physx starts returning divergence velocity values on rigidbodies after a certain time. This does not occur with multithreading off.

Since this isn’t a known limitation, I’ll try digging down into the physx code myself to see if I can figure out why this is happening…are there any places you know of in particular that may be the culprit? I’m not really familiar with how physx does its threading, but am familiar with threading practices in general.

Deterministic results basically depend on the initial state being consistent (let’s assume that this is the case) and that the exact same contacts are generated and the constraints (both contacts and joints) are processed by the solver in the same order.

Please note that even a tiny 1-bit difference in a value potentially snowballs into visibly divergent results over time and these differences are often lost if you instrument using printf so you might need to reinterpret float values as integers and log those to identify 1-bit differences in bit patterns.

Good point! When I get a chance I’ll re-run my logger doing exactly that and post the results. Since everything is floats I’m definitely open to the possibility of some floating point imprecision happening in my code somewhere.

Everything came back a-ok on my end again, and I just realized…it can’t be my code because everything is perfectly deterministic if I turn PhysX multithreading off. It’s only when I set the number of workers to >1 that things get indeterministic.

I’ll keep digging around in the physx code and report back if I find anything.

Thought I would update this thread:

The issue turned out to be user error on my part.

After a lot of debugging I finally figured out what was going on: I was doing a hashtable modification in contact filter, but forgot to put it inside of a lock guard, so that was occasionally causing a race condition. That also explains why the inconsistent behavior was only occurring when collisions were occurring between rigidbodies.