`gym.simulate()` hanging on `__pthread_cond_wait` in `physx::PxSyncImpl::wait(unsigned int)`

I’m trying to run some simulations using Isaac Gym, but occasionally the simulation hangs. Attaching GDB to the process, I obtain the following backtrace:

(gdb) where
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x565392ee0224) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x565392ee01d0, cond=0x565392ee01f8) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x565392ee01f8, mutex=0x565392ee01d0) at pthread_cond_wait.c:638
#3  0x00007f338e38ea77 in physx::PxSyncImpl::wait(unsigned int) ()
    from /home/nigero/third_party/IsaacGym_Preview_1_Package/isaacgym/python/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so
#4  0x00007f338dfbd8df in physx::NpScene::fetchResults(bool, unsigned int*) ()
   from /home/nigero/third_party/IsaacGym_Preview_1_Package/isaacgym/python/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so
#5  0x00007f338de9fe7f in carb::gym::GymPhysX::simulate (this=0x565393429be0, dt=<optimized out>, substeps=6)
at ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp:2946
#6  0x00007f338de5aa2d in carb::gym::GymSimulate (sim=0x565393426a20) at ../../../source/plugins/carb/gym/impl/Gym/Gym.cpp:3206
#7  0x00007f3390247bc9 in std::function<void (carb::gym::Gym&, carb::gym::GymSim*)>::operator()(carb::gym::Gym&, carb::gym::GymSim*) const (    __args#1=<optimized out>, __args#0=..., this=<optimized out>) at /usr/include/c++/7/bits/std_function.h:706
#8  pybind11::detail::argument_loader<carb::gym::Gym&, carb::gym::GymSim*>::call_impl<void, std::function<void (carb::gym::Gym&, carb::gym::GymSim*)>&, 0ul, 1ul, pybind11::detail::void_type>(std::function<void (carb::gym::Gym&, carb::gym::GymSim*)>&, pybind11::detail::index_sequence<0ul, 1ul>, pybind11::detail::void_type&&) (f=..., this=0x7ffdde85b590) at ../../../_build/target-deps/pybind11/pybind11/cast.h:1931
#9  pybind11::detail::argument_loader<carb::gym::Gym&, carb::gym::GymSim*>::call<void, pybind11::detail::void_type, std::function<void (carb::gym::Gym&, carb::gym::GymSim*)>&>(std::function<void (carb::gym::Gym&, carb::gym::GymSim*)>&) && (f=..., this=<optimized out>)    at ../../../_build/target-deps/pybind11/pybind11/cast.h:1913

Any ideas what might be happening? Are there any calls that I’m missing or something? I’m happy to provide more details to be able to resolve this issue.

Hi @oswinso,

Are any standard examples working for you? Can you check that your asset can be loaded properly, say replacing the standard one in any of the examples? And if the asset is not the case, can you share the code snippet to reproduce the issue?

The standard examples all work for me. In my case, I am running a modified Ant.

After more experimentation, I noticed that if I sprinkle torch.cuda.synchronize() calls everywhere, I get the following error after a large number of iterations:

[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/GymPhysXCuda.cu: 844
Traceback (most recent call last):
File "main.py", line 319, in <module> 
    main()
File "main.py", line 311, in main
    policy, optimizer, dynamics, trajopt, config, target_state, save_path, do_plot=True
File "main.py", line 155, in perform_run
  run_result: MPCTrainer.RunResult = runner.run()
File "XXX", line 92, in run
  x.view(1, 1, self.dynamics.n_x), u.view(1, 1, self.dynamics.n_u), dw_mpc
File "XXX/ant.py", line 508, in propagate_real
  self.pre_physics_step(noisy_action) 
File "XXX/ant.py", line 643, in pre_physics_step
  torch.cuda.synchronize()
File "XXXX/lib/python3.7/site-packages/torch/cuda/__init__.py", line 380, in synchronize
  return torch._C._cuda_synchronize() 
RuntimeError: CUDA error: an illegal memory access was encountered
[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/TensorUtils.cpp: 155
[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/TensorUtils.cpp: 155
[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/TensorUtils.cpp: 155
[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/TensorUtils.cpp: 155

The specific torch.cuda.synchronize is the one right after I call self.gym.set_dof_actuation_force_tensor(self.sim, forces_tensor) in pre_physics_step:

def pre_physics_step(self, actions: torch.Tensor):
    # Apply actuation to the position.
    self.actions_tensor = actions.clone().to(self.device).squeeze()
    torch.cuda.synchronize()
    forces = self.actions_tensor * self._joint_gears * self._power_scale
    torch.cuda.synchronize()
    forces_tensor = gymtorch.unwrap_tensor(forces)
    torch.cuda.synchronize()
    self.gym.set_dof_actuation_force_tensor(self.sim, forces_tensor)
    torch.cuda.synchronize()  # <- Crashing after I call this here

I’m not able to share the entire codebase because it is very large and the problem happens very nondeterministically, but I’ll see if I can create a small reproducing sample of it.

I have a similar issue.

My script stops with “Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)”

If I use faulthandler module, I can track the issue to self.gym.simulate(self.sim).

To give some context: I have a franka robot, table, cube and a sphere (target). I want to reset the cube and sphere to a random position each episode. I’m doing this using the following function:

    self.cube_pos = self.np_random.uniform([-0.13, -0.3, 0.41], [0.2, 0.3, 0.41], size=(self.num_envs, 3))
    self.cube_pos[self.np_random.random(self.num_envs) < 0.5] = self.reset_hand_pos.detach().cpu().numpy()
    self.cube_pos = torch.tensor(self.cube_pos, dtype=torch.float32, device=self.device)

    self.goal_pos = torch.tensor(
        self.np_random.uniform(
            [-0.1, -0.3, 0.5], [0.13, 0.3, 0.65],
            size=(self.num_envs, 3)
        ),
        dtype=torch.float32
    ).to(self.device)

    o = gymapi.Quat.from_euler_zyx(0., 0., 0.)
    o = torch.tensor([[o.x, o.y, o.z, o.w]] * self.num_envs, dtype=torch.float32).to(self.device)

    self.root_positions[self.extra_assets_ids] = self.cube_pos
    self.root_orientations[self.extra_assets_ids] = o
    self.root_velocities[self.extra_assets_ids] = torch.zeros_like(self.root_velocities[self.extra_assets_ids])

    self.root_positions[self.targets] = self.goal_pos
    self.gym.set_actor_root_state_tensor_indexed(self.sim, self._root_tensor,
                                                 gymtorch.unwrap_tensor(
                                                     torch.tensor(self.targets+self.extra_assets_ids, dtype=torch.int32, device=self.device)),
                                                 len(self.targets+self.extra_assets_ids))

If I disable the line self.root_positions[self.extra_assets_ids] = self.cube_pos, there is no crash. If I set self.cube_pos to a constant, again, it works. However, even if I try to use self.goal_pos on that line, it still crashes.

I’ve just ran it with gdb and got this error:

0x00007fffd6472757 in physx::Sc::Scene::unregisterInteractions(physx::PxBaseTask*) () from /home/mihai/PycharmProjects/isaac-lightning/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so

After where:

#0  0x00007fffd6472757 in physx::Sc::Scene::unregisterInteractions(physx::PxBaseTask*) () from /home/mihai/PycharmProjects/isaac-lightning/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so
#1  0x00007fffd63a2061 in physx::Cm::Task::run() () from /home/mihai/PycharmProjects/isaac-lightning/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so
#2  0x00007fffd65fba3c in physx::Ext::CpuWorkerThread::execute() () from /home/mihai/PycharmProjects/isaac-lightning/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so
#3  0x00007fffd676ed35 in physx::(anonymous namespace)::PxThreadStart(void*) () from /home/mihai/PycharmProjects/isaac-lightning/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so
#4  0x00007ffff7bbb6db in start_thread (arg=0x7ffefbbfd700) at pthread_create.c:463
#5  0x00007ffff78e471f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

In my case, I found that the error was happening when the cube was spawning too close to the robot fingers. Using the aggregate functions seems to fix it. @vmakoviychuk, is there any chance someone can add a more in-depth explanation in the docs of the aggregation functions and why they are required?

Thanks!

Hello @mihai.anca13,

In the next release coming soon, we’ll provide more detailed documentation for different features, including aggregates.

1 Like