Training/simulation crashing, possibly due to NaN values

Isaac Sim Version

4.2.0

Isaac Lab Version (if applicable)

1.2

Operating System

Ubuntu 22.04

GPU Information

  • Model: RTX 3090
  • Driver Version: 535.183.01

Topic Description

Hi developers,

I am working on a setup where multiple drones carry an object together with strings. The strings are modeled as 7 thin links with ball joints between them. The ball joints are modeled as 3 continuous joints in x, y and z direction. It looks like this:

Detailed Description

The links have a very low mass and inertia and before I thought this could create some stability issues. What I observed: given big forces and torques, the setup makes very wild movements and then the setup disappears from the simulation. Probably due to some NaN values. This also caused my training to crash. My hypothesis before was that the cable links have very low mass and inertia causing them to fly away when under high loads, for example when compressed. So, I added numerous termination terms that stopped this from happening, for example limiting the angle between the drone and the links, setting terminations on high velocity values, angular rates etc.

This pretty much solved the issue for a long time until I implemented a low level controller for the drones to go with it. The behaviour of the low level controller seems fine, similar to what we get in real life. Without the RL policy is does not exhibit any strange behaviour, moreover the actions (forces on the drone rotors) are clamped as well. Now, the error has come back and I get this error when training with SKRL:

[Error] [omni.physx.plugin] PhysX error: The application needs to increase PxGpuDynamicsMemoryConfig::foundLostAggregatePairsCapacity to -1423966208, otherwise, the simulation will miss interactions, FILE /builds/omniverse/physics/physx/source/gpubroadphase/src/PxgAABBManager.cpp, LINE 1269

And this when using RSL-RL:
File “/home/isaac-sim/.local/share/ov/pkg/isaac-sim-4.2.0/exts/omni.isaac.ml_archive/pip_prebundle/torch/distributions/normal.py”, line 71, in sample
return torch.normal(self.loc.expand(shape), self.scale.expand(shape))
RuntimeError: normal expects all elements of std >= 0.0

Is there any way to figure out where the crash comes from? Like finding the last state of the drones in some log somewhere? Or do you have any idea on what it could be?

Thanks in advance!

Jack

Additional Information

What I’ve Tried

Clamping the actions, and the inputs

I have faced this myself before, refer to:

and

Hey thanks for replying! I haven’t tried editing the observations directly yet so I’ll give that a shot. I see it does not solve the issue completely so I also have some tips if you haven’t tried these yet:

  • Increase simulation frequency, this will reduce the time between each physics step and (most likely) avoid bodies like my cable links from “exploding”. It does however increase computation time by quite a lot.
  • Termination terms that could lead to this behaviour, for me it’s constraining the angle between the drones and the load, so the cables don’t compress and get high forces on them.
  • Terminate when some large states are reached, such as high angular velocities. This might hurt exploration though, or if you need high velocities it makes the policy more conservative.
1 Like

Thank you for your interest in Isaac Lab. If you still need help, to ensure efficient support and collaboration, please submit your topic to its GitHub repo following the instructions provided on Isaac Lab’s Contributing Guidelines regarding discussions, submitting issues, feature requests, and contributing to the project.

We appreciate your understanding and look forward to assisting you.