Training/simulation crashing, possibly due to NaN values

jack-zeng · October 23, 2024, 9:19am

Isaac Sim Version

4.2.0

Isaac Lab Version (if applicable)

1.2

Operating System

Ubuntu 22.04

GPU Information

Model: RTX 3090
Driver Version: 535.183.01

Topic Description

Hi developers,

I am working on a setup where multiple drones carry an object together with strings. The strings are modeled as 7 thin links with ball joints between them. The ball joints are modeled as 3 continuous joints in x, y and z direction. It looks like this:

Detailed Description

The links have a very low mass and inertia and before I thought this could create some stability issues. What I observed: given big forces and torques, the setup makes very wild movements and then the setup disappears from the simulation. Probably due to some NaN values. This also caused my training to crash. My hypothesis before was that the cable links have very low mass and inertia causing them to fly away when under high loads, for example when compressed. So, I added numerous termination terms that stopped this from happening, for example limiting the angle between the drone and the links, setting terminations on high velocity values, angular rates etc.

This pretty much solved the issue for a long time until I implemented a low level controller for the drones to go with it. The behaviour of the low level controller seems fine, similar to what we get in real life. Without the RL policy is does not exhibit any strange behaviour, moreover the actions (forces on the drone rotors) are clamped as well. Now, the error has come back and I get this error when training with SKRL:

[Error] [omni.physx.plugin] PhysX error: The application needs to increase PxGpuDynamicsMemoryConfig::foundLostAggregatePairsCapacity to -1423966208, otherwise, the simulation will miss interactions, FILE /builds/omniverse/physics/physx/source/gpubroadphase/src/PxgAABBManager.cpp, LINE 1269

And this when using RSL-RL:
File “/home/isaac-sim/.local/share/ov/pkg/isaac-sim-4.2.0/exts/omni.isaac.ml_archive/pip_prebundle/torch/distributions/normal.py”, line 71, in sample
return torch.normal(self.loc.expand(shape), self.scale.expand(shape))
RuntimeError: normal expects all elements of std >= 0.0

Is there any way to figure out where the crash comes from? Like finding the last state of the drones in some log somewhere? Or do you have any idea on what it could be?

Thanks in advance!

Jack

Additional Information

What I’ve Tried

Clamping the actions, and the inputs

edsonbffilho · October 23, 2024, 7:46pm

I have faced this myself before, refer to:

github.com/isaac-sim/OmniIsaacGymEnvs

When training, I randomly get this error: RuntimeError: normal expects all elements of std >= 0.0

opened 01:05PM - 06 Feb 24 UTC

eferreirafilho

I'm using the FrankaDeformable as a basis for my own OIGE simulation. I sometime…s get this error: ```fps step: 45458 fps step and policy inference: 45190 fps total: 42233 epoch: 3490/100000 frames: 457310208 fps step: 45537 fps step and policy inference: 45269 fps total: 42296 epoch: 3491/100000 frames: 457441280 fps step: 45750 fps step and policy inference: 45477 fps total: 42487 epoch: 3492/100000 frames: 457572352 2024-02-05 15:35:47 [10,896,832ms] [Error] [omni.kit.app._impl] [py stderr]: Error executing job with overrides: ['task=UR5eTask', 'headless=True', 'num_envs=8192', 'max_iterations=100000'] Error executing job with overrides: ['task=UR5eTask', 'headless=True', 'num_envs=8192', 'max_iterations=100000'] 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: Traceback (most recent call last): Traceback (most recent call last): 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 142, in parse_hydra_configs rlg_trainer.run() File "/home/user/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 142, in parse_hydra_configs rlg_trainer.run() 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 74, in run runner.run( File "/home/user/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 74, in run runner.run( 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 133, in run self.run_train(args) File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 133, in run self.run_train(args) 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 116, in run_train agent.train() File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 116, in run_train agent.train() 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1318, in train step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch() File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1318, in train step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch() 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1182, in train_epoch batch_dict = self.play_steps() File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1182, in train_epoch batch_dict = self.play_steps() 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 742, in play_steps res_dict = self.get_action_values(self.obs) File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 742, in play_steps res_dict = self.get_action_values(self.obs) 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 408, in get_action_values res_dict = self.model(input_dict) File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 408, in get_action_values res_dict = self.model(input_dict) 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/extscache/omni.pip.torch-2_0_1-2.0.2+105.1.lx64/torch-2-0-1/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/extscache/omni.pip.torch-2_0_1-2.0.2+105.1.lx64/torch-2-0-1/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/algos_torch/models.py", line 278, in forward selected_action = distr.sample() File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/algos_torch/models.py", line 278, in forward selected_action = distr.sample() 2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/extscache/omni.pip.torch-2_0_1-2.0.2+105.1.lx64/torch-2-0-1/torch/distributions/normal.py", line 70, in sample return torch.normal(self.loc.expand(shape), self.scale.expand(shape)) File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/extscache/omni.pip.torch-2_0_1-2.0.2+105.1.lx64/torch-2-0-1/torch/distributions/normal.py", line 70, in sample return torch.normal(self.loc.expand(shape), self.scale.expand(shape)) 2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: RuntimeError: normal expects all elements of std >= 0.0 RuntimeError: normal expects all elements of std >= 0.0 2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. 2024-02-05 15:35:47 [10,897,201ms] [Warning] [omni.stageupdate.plugin] Deprecated: direct use of IStageUpdate callbacks is deprecated. Use IStageUpdate::getStageUpdate instead. 2024-02-05 15:35:47 [10,897,262ms] [Warning] [carb.audio.context] 1 contexts were leaked 2024-02-05 15:35:47 [10,897,307ms] [Warning] [carb] Recursive unloadAllPlugins() detected! 2024-02-05 15:35:47 [10,897,336ms] [Warning] [omni.core.ITypeFactory] Module /home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/exts/omni.activity.core/bin/libomni.activity.core.plugin.so remained loaded after unload request. There was an error running python user@ws1:~/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs$ ``` It seems random. Sometimes I train for 200 iterations and the error appears, sometimes it appears after thousands of iterations, and sometimes It does never appear. The error seems to appear less when running with fewer envs. My guess is that some robot pose trigger some invalid math operation in torch and trigger the error, But I have no clue how to solve it. I get the same error in Isaac 2023.0.1-hotfix and Isaac 2023.1.1. The same error in two different machines: Machine 1: Kubuntu 22.04 RTX A5000 24Gb Vram NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 Machine 2: Kubuntu 22.04 RTX 3080 16GB Vram NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 I tried adding some checks to prevent the error, but it did not help, and I never fall in the implemented conditions: ``` def pre_physics_step(self, actions) -> None: if not self.world.is_playing(): return reset_env_ids = self.reset_buf.nonzero(as_tuple=False).squeeze(-1) if len(reset_env_ids) > 0: self.reset_idx(reset_env_ids) # Actions are assumed to be in the range [-1, 1], which will be scaled to the joint limits self.actions = actions.clone().to(self._device) # Ensure no NaN or Inf values if torch.any(torch.isnan(self.actions)) or torch.any(torch.isinf(self.actions)): carb.log_error("NaN or Inf in actions tensor") return ``` ``` def get_observations(self) -> dict: [...] if torch.isnan(self.obs_buf).any(): print("NaN found in obs_buf") print(self.obs_buf[torch.isnan(self.obs_buf)]) if torch.isinf(self.obs_buf).any(): print("Inf found in obs_buf") print(self.obs_buf[torch.isinf(self.obs_buf)]) return observations ``` Any ideas? Thanks!!

and

jack-zeng · November 19, 2024, 9:46am

Hey thanks for replying! I haven’t tried editing the observations directly yet so I’ll give that a shot. I see it does not solve the issue completely so I also have some tips if you haven’t tried these yet:

Increase simulation frequency, this will reduce the time between each physics step and (most likely) avoid bodies like my cable links from “exploding”. It does however increase computation time by quite a lot.
Termination terms that could lead to this behaviour, for me it’s constraining the angle between the drones and the load, so the cables don’t compress and get high forces on them.
Terminate when some large states are reached, such as high angular velocities. This might hurt exploration though, or if you need high velocities it makes the policy more conservative.

phennings · December 20, 2024, 5:20pm

Thank you for your interest in Isaac Lab. If you still need help, to ensure efficient support and collaboration, please submit your topic to its GitHub repo following the instructions provided on Isaac Lab’s Contributing Guidelines regarding discussions, submitting issues, feature requests, and contributing to the project.

We appreciate your understanding and look forward to assisting you.

Topic		Replies	Views
Trying to port Jetbot RL RuntimeError: normal expects all elements of std >= 0.0 Isaac Gym	13	769	March 21, 2024
Errors when running Cartpole RL example in Isaac Sim 2023.1.1 Isaac Sim	4	263	June 27, 2024
Problem with CUDA Memory Isaac Sim cuda , ubuntu , cudnn , isaac-sim-v4-0-0	7	195	March 14, 2025
Generator.py crashes - 'std::bad_weak_ptr' Isaac Sim synthetic-data	3	785	April 5, 2024
I started isaac-sim and got segmentation fault Isaac Sim ubuntu , isaac-sim-v4-2-0	2	106	November 3, 2024
Isaac sim container could not runheadless.native.sh Isaac Sim	2	365	November 29, 2023
Isaac_sim crashed when I clicke "load world" in my extension Isaac Sim	2	694	May 30, 2023
Detected a blocking function. This will cause hitches or hangs in the UI. Please switch to the async version Isaac Sim	15	2814	December 7, 2023
Isaac Sim crashes when importing URDF with many links Isaac Sim	5	523	August 5, 2023
Crash when testing ${ISAACSIM_PYTHON_EXE} ${ISAACSIM_PATH}/standalone_examples/api/omni.isaac.core/add_cubes.py Isaac Sim isaac-sim-v4-0-0	1	82	October 2, 2024