Trying to port Jetbot RL RuntimeError: normal expects all elements of std >= 0.0

irvinh · March 7, 2024, 8:34pm

Hi, I’m trying to modify the OmniIsaacGymEnvs Cartpole task of section 9.2 to solve the the Jetbot task of moving toward a goal object from tutorial 9.9.

I’m able to import Jetbot and start the simulation, but it seems after the first step the simulation crashes and I get the error RuntimeError: normal expects all elements of std >= 0.0.

I found this other post with suggestions about debugging this error, but the observations seem fine

E.g. here are the contents of self.obs_buf before the crash

OBSERVATIONS

tensor([[ 2.5362, 0.5362, 0.6240, 1.5362, 0.5362, 0.5362, 0.5362, 0.5362,

0.5362, 0.4545, 0.5362, 0.5362, 0.5362, 3.1362, 0.8362, 0.5862],

[-1.4638, 0.5362, 0.6240, 1.5362, 0.5362, 0.5362, 0.5362, 0.5362,

0.5362, 0.4545, 0.5362, 0.5362, 0.5362, -0.8638, 0.8362, 0.5862]],

device=‘cuda:0’)

And I’ve set the velocities of the articulationview containing the jetbots to 0.

velocities = torch.zeros((self._num_envs, 6))

self._jetbots.set_velocities(velocities)

I also tried printing out arguments in various parts of the stack trace e.g.

File “C:\Users\irvin\AppData\Local\ov\pkg\isaac_sim-2023.1.1/extscache/omni.pip.torch-2_0_1-2.0.2+105.1.wx64/torch-2-0-1\torch\nn\modules\module.py”, line 1504, in _call_impl

return forward_call(*args, **kwargs)

And I am seeing NaNs, but I’m not really sure where they’re coming from or how to figure that out. Any suggestions on where to look or what to try next to get things working would be super helpful, thanks!

edsonbffilho · March 15, 2024, 4:33pm

I’m having the same exact problem in a totally different simulation. My simulation is based on the Franka Deformable one. Could not find a solution yet :(

I have opened an issue in github:

github.com/NVIDIA-Omniverse/OmniIsaacGymEnvs

When training, I randomly get this error: RuntimeError: normal expects all elements of std >= 0.0

opened 01:05PM - 06 Feb 24 UTC

eferreirafilho

I'm using the FrankaDeformable as a basis for my own OIGE simulation. I sometime…s get this error: ```fps step: 45458 fps step and policy inference: 45190 fps total: 42233 epoch: 3490/100000 frames: 457310208 fps step: 45537 fps step and policy inference: 45269 fps total: 42296 epoch: 3491/100000 frames: 457441280 fps step: 45750 fps step and policy inference: 45477 fps total: 42487 epoch: 3492/100000 frames: 457572352 2024-02-05 15:35:47 [10,896,832ms] [Error] [omni.kit.app._impl] [py stderr]: Error executing job with overrides: ['task=UR5eTask', 'headless=True', 'num_envs=8192', 'max_iterations=100000'] Error executing job with overrides: ['task=UR5eTask', 'headless=True', 'num_envs=8192', 'max_iterations=100000'] 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: Traceback (most recent call last): Traceback (most recent call last): 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 142, in parse_hydra_configs rlg_trainer.run() File "/home/user/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 142, in parse_hydra_configs rlg_trainer.run() 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 74, in run runner.run( File "/home/user/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 74, in run runner.run( 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 133, in run self.run_train(args) File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 133, in run self.run_train(args) 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 116, in run_train agent.train() File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 116, in run_train agent.train() 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1318, in train step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch() File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1318, in train step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch() 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1182, in train_epoch batch_dict = self.play_steps() File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1182, in train_epoch batch_dict = self.play_steps() 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 742, in play_steps res_dict = self.get_action_values(self.obs) File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 742, in play_steps res_dict = self.get_action_values(self.obs) 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 408, in get_action_values res_dict = self.model(input_dict) File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 408, in get_action_values res_dict = self.model(input_dict) 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/extscache/omni.pip.torch-2_0_1-2.0.2+105.1.lx64/torch-2-0-1/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/extscache/omni.pip.torch-2_0_1-2.0.2+105.1.lx64/torch-2-0-1/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/algos_torch/models.py", line 278, in forward selected_action = distr.sample() File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/algos_torch/models.py", line 278, in forward selected_action = distr.sample() 2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/extscache/omni.pip.torch-2_0_1-2.0.2+105.1.lx64/torch-2-0-1/torch/distributions/normal.py", line 70, in sample return torch.normal(self.loc.expand(shape), self.scale.expand(shape)) File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/extscache/omni.pip.torch-2_0_1-2.0.2+105.1.lx64/torch-2-0-1/torch/distributions/normal.py", line 70, in sample return torch.normal(self.loc.expand(shape), self.scale.expand(shape)) 2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: RuntimeError: normal expects all elements of std >= 0.0 RuntimeError: normal expects all elements of std >= 0.0 2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: 2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. 2024-02-05 15:35:47 [10,897,201ms] [Warning] [omni.stageupdate.plugin] Deprecated: direct use of IStageUpdate callbacks is deprecated. Use IStageUpdate::getStageUpdate instead. 2024-02-05 15:35:47 [10,897,262ms] [Warning] [carb.audio.context] 1 contexts were leaked 2024-02-05 15:35:47 [10,897,307ms] [Warning] [carb] Recursive unloadAllPlugins() detected! 2024-02-05 15:35:47 [10,897,336ms] [Warning] [omni.core.ITypeFactory] Module /home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/exts/omni.activity.core/bin/libomni.activity.core.plugin.so remained loaded after unload request. There was an error running python user@ws1:~/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs$ ``` It seems random. Sometimes I train for 200 iterations and the error appears, sometimes it appears after thousands of iterations, and sometimes It does never appear. The error seems to appear less when running with fewer envs. My guess is that some robot pose trigger some invalid math operation in torch and trigger the error, But I have no clue how to solve it. I get the same error in Isaac 2023.0.1-hotfix and Isaac 2023.1.1. The same error in two different machines: Machine 1: Kubuntu 22.04 RTX A5000 24Gb Vram NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 Machine 2: Kubuntu 22.04 RTX 3080 16GB Vram NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 I tried adding some checks to prevent the error, but it did not help, and I never fall in the implemented conditions: ``` def pre_physics_step(self, actions) -> None: if not self.world.is_playing(): return reset_env_ids = self.reset_buf.nonzero(as_tuple=False).squeeze(-1) if len(reset_env_ids) > 0: self.reset_idx(reset_env_ids) # Actions are assumed to be in the range [-1, 1], which will be scaled to the joint limits self.actions = actions.clone().to(self._device) # Ensure no NaN or Inf values if torch.any(torch.isnan(self.actions)) or torch.any(torch.isinf(self.actions)): carb.log_error("NaN or Inf in actions tensor") return ``` ``` def get_observations(self) -> dict: [...] if torch.isnan(self.obs_buf).any(): print("NaN found in obs_buf") print(self.obs_buf[torch.isnan(self.obs_buf)]) if torch.isinf(self.obs_buf).any(): print("Inf found in obs_buf") print(self.obs_buf[torch.isinf(self.obs_buf)]) return observations ``` Any ideas? Thanks!!

irvinh · March 17, 2024, 5:42pm

Thanks for the link to the Github issue! I’ve had some luck by adjusting the number of environments and mini batch size. I was trying to test with really small values so that may have been the problem, but not sure why since I’m still new to all this and learning about neural nets and RL etc. I’ll post an update if I learn anything new/have more consistent success.

edsonbffilho · March 18, 2024, 1:42pm

I’m also not sure why I’m having this problem, but sometimes it does not happen. My guess is that with certain parameters, the simulation never ends up in scenarios that cause the numerical issues.

mgussert · March 18, 2024, 4:48pm

Hello all! hopefully I can help :)

The error sounds like one that I have seen from pytorch in general

In [1]: import torch

In [2]: torch.normal(1,-1,(5,))

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[2], line 1
----> 1 torch.normal(1,-1,(5,))

RuntimeError: normal expects std >= 0.0, but found std -1

but if you have a tensor of elements, the error is ambiguous, and so it changes

In [3]: mu = torch.Tensor([0,0,0])

In [4]: std = torch.Tensor([-1,0,-3])

In [5]: torch.normal(mu, std)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[5], line 1
----> 1 torch.normal(mu, std)

RuntimeError: normal expects all elements of std >= 0.0

I’m guessing that somewhere in OIGE there is a very confused normal distribution wondering what the heck a negative standard deviation means, and this could be occurring for any number of reasons. I will point the team to this post :)

I did, however, find this other post with the same error

and that seems to come from a configuration failure within rl_games, which is a third party library we do not support…

There are many, many ways to train an model through reinforcement learning. Even for a single algorithm like PPO there are multiple implementations, and even between those implementations, performance is circumstantial. For these and for many other reasons we are working to integrate the RL features of OIGE and other projects into Isaac-Sim! Our goal is to make it as easy as possible for a user to go from a rigged articulation to a trained policy, and this is an incredibly complex and multi-headed problem. We are working diligently to release these features and get them in your hands as quickly as possible. However, I can’t give you a timeline other than “Soon ™”. Sorry!

Please look forward to it :D

mgussert · March 18, 2024, 5:00pm

It sounds like this error occurs most commonly when NaNs are passed into the policy. I would inspect the values going into the policy first.

edsonbffilho · March 18, 2024, 10:41pm

Thanks so much for answering and looking into this @mgussert. Keep up the great work!

Can you clarify what you mean by “I would inspect the values going into the policy first.”? This is somewhere inside rl_games? I have a simulation based on the Franka Deformable and Franka Cabinet examples, I have implemented safeguards in get_observations(), pre_physics_step() and is_done() methods to guarantee that there aren’t any NaNs in the tensors, but I still often get: “RuntimeError: normal expects all elements of std >= 0.0”

mgussert · March 19, 2024, 2:24pm

We are abandoning rl_games for these kinds of issues among other reasons.

It could be a naming issue between one of the elements of the USD / URDF scene that fails, and the default consequence of that also results in this error. It could be the result of how the various distributions are managed and used. It could be the result of a simple missing exception catch within the codebase, etc…

If you are interested in doing RL with isaac-sim I would strongly recommend you checkout our stand alone examples that use SB3. You can find them in your installation under source/standalone_examples/api/omni.isaac.gym/. Our future iterations on RL in isaac-sim will probably use these as a springboard.

If you need any help with that, please post here!

edsonbffilho · March 19, 2024, 7:12pm

Thanks again @mgussert for taking the time to explain.

I found only the cartpole example under source/standalone_examples/api/omni.isaac.gym/. Are there more SB3 examples/resources? Are there plans to port OIGE’s rl_games examples to SB3?

Thanks!

mgussert · March 19, 2024, 7:53pm

The plan is to expand the RL capabilities of Isaac-sim in general. I expect much of the work in OIGE will be ported over and / or integrated into isaac-sim :) I can’t say anything definitively though, if only because I don’t want to step on toes XD

irvinh · March 20, 2024, 10:02pm

Thanks for your answers @mgussert! Interesting to hear rl_games is being left behind, my understanding is that was what allowed isaac sim/gym to have the physics simulation run on the gpu, which sped up training by a lot. I did see stable baselines has vectorized environments, but it wasn’t clear whether this meant you could get the gpu speedup like with rl_games (the existence of this project and some questions on stack overflow/github issues makes me think no? GitHub - MetcalfeTom/stable-baselines3-GPU: A GPU-accelerated fork of stable-baselines. Delivering reliable implementations of reinforcement learning algorithms. or at least it’s not standard). Do you know if the expansion of RL capabilities in Isaac sim include the advantages of Isaac Gym simulating many environments in parallel?

@edsonbffilho for your question of other SB3 examples, not sure if you saw, but the Jetbot task I was trying to convert uses SB3 9.9. Reinforcement Learning using Stable Baselines — Omniverse IsaacSim latest documentation

Also in my case (for trying to use rl_games) when I was looking at the stack trace it seemed to happen when running the model to get the actions rl_games/rl_games/common/a2c_common.py at master · Denys88/rl_games · GitHub. Even though the observations passed in didn’t have NaNs.

I started seeing NaNs in what I’d guess is the forward pass of the model pytorch/torch/nn/modules/module.py at main · pytorch/pytorch · GitHub. But I didn’t understand/find the code where the mean/std deviation is being generated by the neural network for A2C and used to sample the actions where I’m assuming the error is happening (i.e. the forward pass is generating a negative standard deviation and when that is used to sample an action we’re seeing the error above…although I may be completely off on all of this…)

mgussert · March 21, 2024, 4:15pm

Hello irvinh!

The goal is to save all of the good and none of the bad in this integration. The creation of performant GPU based vectorized training environments is a core requirement of RL that we will absolutely support. I am confidant in this. While I don’t have the clout or the authority to make any sort of official guarantee like “We will release this feature with these specs at this time”, I don’t see a path forward without GPU accelerated vectorized training environments. I also feel confident in communicating that this perspective is shared by the powers that be (the people in control of actually designing this integration).

There are many reasons why we are moving away from rl_games, but the biggest one is that we don’t want to produce RL tools that will be tied to specific third party libraries. Rather, we seek to create that software which will motivate the creation of new RL libraries and furthers GPU integration into those that already exist. The major hurdle along this path is managing and exposing the appropriate data on the GPU, which is complicated enough as it is. It gets even trickier when the notion of what a “GPU” is expands to mean “a DGX based data center”.

USD provides us with a universal data interchange format for representing arbitrary 3D scenes, but this generality comes at a cost to performance. We get around this though Fabric and USDRT, which “mirrors” the data on the stage; after all, any data on the stage is coming from the GPU anyway, so this problem reduces to one of associating appropriate addresses on the device. Much of this code is very “low level”, meaning that it involves handling the actual gearing that lets the whole thing run in the first place. Ideally, users wanting to use the simulation for reinforcement learning shouldn’t need deep, intimate knowledge of how the data is managed on the GPU.

Creating a tool set for defining these GPU based vectorized environments therefore means not just exposing fabric and usdrt functionality to the user, but also answering a whole slew of questions though design. How do you manage an arbitrary subset of your environments needing to be reset when they are distributed across multiple DGX machines? What are all the different use cases / RL algorithms available and do we satisfy those use cases at a minimum? How do we handle synchronicity and asynchronicity? How deeply should a user need to be concerned with the “vectorized” nature of the environment? etc…

It’s a huge and complicated problem, but it’s also exciting :D

irvinh · March 21, 2024, 10:08pm

Very cool to hear and looking forward to it! Thanks again for taking the time to answer these questions, sounds very exciting indeed.

system · April 4, 2024, 10:09pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Training/simulation crashing, possibly due to NaN values Isaac Sim isaac-sim-v4-2-0	3	186	December 20, 2024
Errors when running Cartpole RL example in Isaac Sim 2023.1.1 Isaac Sim	4	263	June 27, 2024
Unable to train multi environment robot Isaac Sim isaacsim , gym	8	2614	December 28, 2022
Instance with invalid (NULL) class pointer DeepStream SDK	58	3074	December 29, 2022
Isaac 2023.1.0 examples slow and buggy Isaac Sim	24	2710	November 14, 2023
SKRL Documentation Jetbot example for Isaac sim doesn't work Isaac Sim rl	13	852	September 4, 2023
Error loading .trt model Jetson AGX Orin tensorrt	7	144	November 6, 2024
ERORR with ONNX2TRT : Unknown embedded device detected Jetson Xavier NX onnx	18	4559	April 27, 2022
Trtexec model conversion crashed at insufficient gpu memory Jetson Orin NX jetson-inference	27	5067	January 11, 2023
Cannot import `omni` related modules Isaac Sim	17	7013	April 23, 2023

Trying to port Jetbot RL RuntimeError: normal expects all elements of std >= 0.0

Related topics