I'm using the FrankaDeformable as a basis for my own OIGE simulation. I sometime…s get this error:
```fps step: 45458 fps step and policy inference: 45190 fps total: 42233 epoch: 3490/100000 frames: 457310208
fps step: 45537 fps step and policy inference: 45269 fps total: 42296 epoch: 3491/100000 frames: 457441280
fps step: 45750 fps step and policy inference: 45477 fps total: 42487 epoch: 3492/100000 frames: 457572352
2024-02-05 15:35:47 [10,896,832ms] [Error] [omni.kit.app._impl] [py stderr]: Error executing job with overrides: ['task=UR5eTask', 'headless=True', 'num_envs=8192', 'max_iterations=100000']
Error executing job with overrides: ['task=UR5eTask', 'headless=True', 'num_envs=8192', 'max_iterations=100000']
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: Traceback (most recent call last):
Traceback (most recent call last):
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]:
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 142, in parse_hydra_configs
rlg_trainer.run()
File "/home/user/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 142, in parse_hydra_configs
rlg_trainer.run()
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]:
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 74, in run
runner.run(
File "/home/user/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 74, in run
runner.run(
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]:
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 133, in run
self.run_train(args)
File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 133, in run
self.run_train(args)
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]:
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 116, in run_train
agent.train()
File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 116, in run_train
agent.train()
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]:
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1318, in train
step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch()
File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1318, in train
step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch()
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]:
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1182, in train_epoch
batch_dict = self.play_steps()
File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1182, in train_epoch
batch_dict = self.play_steps()
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]:
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 742, in play_steps
res_dict = self.get_action_values(self.obs)
File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 742, in play_steps
res_dict = self.get_action_values(self.obs)
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]:
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 408, in get_action_values
res_dict = self.model(input_dict)
File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 408, in get_action_values
res_dict = self.model(input_dict)
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]:
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/extscache/omni.pip.torch-2_0_1-2.0.2+105.1.lx64/torch-2-0-1/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/extscache/omni.pip.torch-2_0_1-2.0.2+105.1.lx64/torch-2-0-1/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]:
2024-02-05 15:35:47 [10,897,000ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/algos_torch/models.py", line 278, in forward
selected_action = distr.sample()
File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/python/lib/python3.10/site-packages/rl_games/algos_torch/models.py", line 278, in forward
selected_action = distr.sample()
2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]:
2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/extscache/omni.pip.torch-2_0_1-2.0.2+105.1.lx64/torch-2-0-1/torch/distributions/normal.py", line 70, in sample
return torch.normal(self.loc.expand(shape), self.scale.expand(shape))
File "/home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/extscache/omni.pip.torch-2_0_1-2.0.2+105.1.lx64/torch-2-0-1/torch/distributions/normal.py", line 70, in sample
return torch.normal(self.loc.expand(shape), self.scale.expand(shape))
2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]:
2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]: RuntimeError: normal expects all elements of std >= 0.0
RuntimeError: normal expects all elements of std >= 0.0
2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]:
2024-02-05 15:35:47 [10,897,001ms] [Error] [omni.kit.app._impl] [py stderr]:
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
2024-02-05 15:35:47 [10,897,201ms] [Warning] [omni.stageupdate.plugin] Deprecated: direct use of IStageUpdate callbacks is deprecated. Use IStageUpdate::getStageUpdate instead.
2024-02-05 15:35:47 [10,897,262ms] [Warning] [carb.audio.context] 1 contexts were leaked
2024-02-05 15:35:47 [10,897,307ms] [Warning] [carb] Recursive unloadAllPlugins() detected!
2024-02-05 15:35:47 [10,897,336ms] [Warning] [omni.core.ITypeFactory] Module /home/user/.local/share/ov/pkg/isaac_sim-2023.1.1/kit/exts/omni.activity.core/bin/libomni.activity.core.plugin.so remained loaded after unload request.
There was an error running python
user@ws1:~/omniverse-playground/OmniIsaacGymEnvs/omniisaacgymenvs$
```
It seems random. Sometimes I train for 200 iterations and the error appears, sometimes it appears after thousands of iterations, and sometimes It does never appear.
The error seems to appear less when running with fewer envs. My guess is that some robot pose trigger some invalid math operation in torch and trigger the error, But I have no clue how to solve it.
I get the same error in Isaac 2023.0.1-hotfix and Isaac 2023.1.1. The same error in two different machines:
Machine 1:
Kubuntu 22.04
RTX A5000 24Gb Vram
NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0
Machine 2:
Kubuntu 22.04
RTX 3080 16GB Vram
NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3
I tried adding some checks to prevent the error, but it did not help, and I never fall in the implemented conditions:
```
def pre_physics_step(self, actions) -> None:
if not self.world.is_playing():
return
reset_env_ids = self.reset_buf.nonzero(as_tuple=False).squeeze(-1)
if len(reset_env_ids) > 0:
self.reset_idx(reset_env_ids)
# Actions are assumed to be in the range [-1, 1], which will be scaled to the joint limits
self.actions = actions.clone().to(self._device)
# Ensure no NaN or Inf values
if torch.any(torch.isnan(self.actions)) or torch.any(torch.isinf(self.actions)):
carb.log_error("NaN or Inf in actions tensor")
return
```
```
def get_observations(self) -> dict:
[...]
if torch.isnan(self.obs_buf).any():
print("NaN found in obs_buf")
print(self.obs_buf[torch.isnan(self.obs_buf)])
if torch.isinf(self.obs_buf).any():
print("Inf found in obs_buf")
print(self.obs_buf[torch.isinf(self.obs_buf)])
return observations
```
Any ideas?
Thanks!!