`gym.simulate()` hanging on `__pthread_cond_wait` in `physx::PxSyncImpl::wait(unsigned int)`

oswinso · February 23, 2021, 3:48pm

I’m trying to run some simulations using Isaac Gym, but occasionally the simulation hangs. Attaching GDB to the process, I obtain the following backtrace:

(gdb) where
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x565392ee0224) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x565392ee01d0, cond=0x565392ee01f8) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x565392ee01f8, mutex=0x565392ee01d0) at pthread_cond_wait.c:638
#3  0x00007f338e38ea77 in physx::PxSyncImpl::wait(unsigned int) ()
    from /home/nigero/third_party/IsaacGym_Preview_1_Package/isaacgym/python/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so
#4  0x00007f338dfbd8df in physx::NpScene::fetchResults(bool, unsigned int*) ()
   from /home/nigero/third_party/IsaacGym_Preview_1_Package/isaacgym/python/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so
#5  0x00007f338de9fe7f in carb::gym::GymPhysX::simulate (this=0x565393429be0, dt=<optimized out>, substeps=6)
at ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp:2946
#6  0x00007f338de5aa2d in carb::gym::GymSimulate (sim=0x565393426a20) at ../../../source/plugins/carb/gym/impl/Gym/Gym.cpp:3206
#7  0x00007f3390247bc9 in std::function<void (carb::gym::Gym&, carb::gym::GymSim*)>::operator()(carb::gym::Gym&, carb::gym::GymSim*) const (    __args#1=<optimized out>, __args#0=..., this=<optimized out>) at /usr/include/c++/7/bits/std_function.h:706
#8  pybind11::detail::argument_loader<carb::gym::Gym&, carb::gym::GymSim*>::call_impl<void, std::function<void (carb::gym::Gym&, carb::gym::GymSim*)>&, 0ul, 1ul, pybind11::detail::void_type>(std::function<void (carb::gym::Gym&, carb::gym::GymSim*)>&, pybind11::detail::index_sequence<0ul, 1ul>, pybind11::detail::void_type&&) (f=..., this=0x7ffdde85b590) at ../../../_build/target-deps/pybind11/pybind11/cast.h:1931
#9  pybind11::detail::argument_loader<carb::gym::Gym&, carb::gym::GymSim*>::call<void, pybind11::detail::void_type, std::function<void (carb::gym::Gym&, carb::gym::GymSim*)>&>(std::function<void (carb::gym::Gym&, carb::gym::GymSim*)>&) && (f=..., this=<optimized out>)    at ../../../_build/target-deps/pybind11/pybind11/cast.h:1913

Any ideas what might be happening? Are there any calls that I’m missing or something? I’m happy to provide more details to be able to resolve this issue.

vmakoviychuk · February 23, 2021, 11:02pm

Hi @oswinso,

Are any standard examples working for you? Can you check that your asset can be loaded properly, say replacing the standard one in any of the examples? And if the asset is not the case, can you share the code snippet to reproduce the issue?

oswinso · February 24, 2021, 6:03pm

The standard examples all work for me. In my case, I am running a modified Ant.

After more experimentation, I noticed that if I sprinkle torch.cuda.synchronize() calls everywhere, I get the following error after a large number of iterations:

[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/GymPhysXCuda.cu: 844
Traceback (most recent call last):
File "main.py", line 319, in <module> 
    main()
File "main.py", line 311, in main
    policy, optimizer, dynamics, trajopt, config, target_state, save_path, do_plot=True
File "main.py", line 155, in perform_run
  run_result: MPCTrainer.RunResult = runner.run()
File "XXX", line 92, in run
  x.view(1, 1, self.dynamics.n_x), u.view(1, 1, self.dynamics.n_u), dw_mpc
File "XXX/ant.py", line 508, in propagate_real
  self.pre_physics_step(noisy_action) 
File "XXX/ant.py", line 643, in pre_physics_step
  torch.cuda.synchronize()
File "XXXX/lib/python3.7/site-packages/torch/cuda/__init__.py", line 380, in synchronize
  return torch._C._cuda_synchronize() 
RuntimeError: CUDA error: an illegal memory access was encountered
[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/TensorUtils.cpp: 155
[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/TensorUtils.cpp: 155
[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/TensorUtils.cpp: 155
[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/TensorUtils.cpp: 155

The specific torch.cuda.synchronize is the one right after I call self.gym.set_dof_actuation_force_tensor(self.sim, forces_tensor) in pre_physics_step:

def pre_physics_step(self, actions: torch.Tensor):
    # Apply actuation to the position.
    self.actions_tensor = actions.clone().to(self.device).squeeze()
    torch.cuda.synchronize()
    forces = self.actions_tensor * self._joint_gears * self._power_scale
    torch.cuda.synchronize()
    forces_tensor = gymtorch.unwrap_tensor(forces)
    torch.cuda.synchronize()
    self.gym.set_dof_actuation_force_tensor(self.sim, forces_tensor)
    torch.cuda.synchronize()  # <- Crashing after I call this here

I’m not able to share the entire codebase because it is very large and the problem happens very nondeterministically, but I’ll see if I can create a small reproducing sample of it.

mihai.anca13 · March 1, 2021, 10:58am

I have a similar issue.

My script stops with “Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)”

If I use faulthandler module, I can track the issue to self.gym.simulate(self.sim).

To give some context: I have a franka robot, table, cube and a sphere (target). I want to reset the cube and sphere to a random position each episode. I’m doing this using the following function:

    self.cube_pos = self.np_random.uniform([-0.13, -0.3, 0.41], [0.2, 0.3, 0.41], size=(self.num_envs, 3))
    self.cube_pos[self.np_random.random(self.num_envs) < 0.5] = self.reset_hand_pos.detach().cpu().numpy()
    self.cube_pos = torch.tensor(self.cube_pos, dtype=torch.float32, device=self.device)

    self.goal_pos = torch.tensor(
        self.np_random.uniform(
            [-0.1, -0.3, 0.5], [0.13, 0.3, 0.65],
            size=(self.num_envs, 3)
        ),
        dtype=torch.float32
    ).to(self.device)

    o = gymapi.Quat.from_euler_zyx(0., 0., 0.)
    o = torch.tensor([[o.x, o.y, o.z, o.w]] * self.num_envs, dtype=torch.float32).to(self.device)

    self.root_positions[self.extra_assets_ids] = self.cube_pos
    self.root_orientations[self.extra_assets_ids] = o
    self.root_velocities[self.extra_assets_ids] = torch.zeros_like(self.root_velocities[self.extra_assets_ids])

    self.root_positions[self.targets] = self.goal_pos
    self.gym.set_actor_root_state_tensor_indexed(self.sim, self._root_tensor,
                                                 gymtorch.unwrap_tensor(
                                                     torch.tensor(self.targets+self.extra_assets_ids, dtype=torch.int32, device=self.device)),
                                                 len(self.targets+self.extra_assets_ids))

If I disable the line self.root_positions[self.extra_assets_ids] = self.cube_pos, there is no crash. If I set self.cube_pos to a constant, again, it works. However, even if I try to use self.goal_pos on that line, it still crashes.

mihai.anca13 · March 1, 2021, 11:12am

I’ve just ran it with gdb and got this error:

0x00007fffd6472757 in physx::Sc::Scene::unregisterInteractions(physx::PxBaseTask*) () from /home/mihai/PycharmProjects/isaac-lightning/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so

After where:

#0  0x00007fffd6472757 in physx::Sc::Scene::unregisterInteractions(physx::PxBaseTask*) () from /home/mihai/PycharmProjects/isaac-lightning/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so
#1  0x00007fffd63a2061 in physx::Cm::Task::run() () from /home/mihai/PycharmProjects/isaac-lightning/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so
#2  0x00007fffd65fba3c in physx::Ext::CpuWorkerThread::execute() () from /home/mihai/PycharmProjects/isaac-lightning/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so
#3  0x00007fffd676ed35 in physx::(anonymous namespace)::PxThreadStart(void*) () from /home/mihai/PycharmProjects/isaac-lightning/isaacgym/_bindings/linux-x86_64/libcarb.gym.plugin.so
#4  0x00007ffff7bbb6db in start_thread (arg=0x7ffefbbfd700) at pthread_create.c:463
#5  0x00007ffff78e471f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

mihai.anca13 · March 2, 2021, 9:24am

In my case, I found that the error was happening when the cube was spawning too close to the robot fingers. Using the aggregate functions seems to fix it. @vmakoviychuk, is there any chance someone can add a more in-depth explanation in the docs of the aggregation functions and why they are required?

Thanks!

vmakoviychuk · March 2, 2021, 10:16pm

Hello @mihai.anca13,

In the next release coming soon, we’ll provide more detailed documentation for different features, including aggregates.

Esenthel · October 3, 2021, 6:32am

Hello, we have similar crash inside unregisterInteractions, I’ve described the problem here:

github.com/NVIDIAGameWorks/PhysX

Crash in 'Sc::Scene::unregisterInteractions'

opened 06:28AM - 03 Oct 21 UTC

GregSlazinski

Hello, One of users in my engine is getting crashes happening in PhysX version …4.1.1.27006925 Sc::Scene::unregisterInteractions ![callstack](https://user-images.githubusercontent.com/7143443/135742654-541b3ca4-11d1-4e7a-90f2-ee75cdc9d78e.PNG) I've seen some other programmer had similar crashes related to 'unregisterInteractions' https://forums.developer.nvidia.com/t/gym-simulate-hanging-on-pthread-cond-wait-in-physx-wait-unsigned-int/169149/4 Do you have any idea what's the problem and how to fix this? Or is this fixed already in 4.1.2? We're on 4.1.1 The user said crashes happen in game during fights with monsters - at that time ragdolls and aggregates are turned on. Please help, thanks

Please help, thank you

Topic		Replies	Views
Simulation crush while training agent in Isaac Sim Isaac Gym	2	778	October 4, 2023
How to debug when `gym.simulate()` hangs? Isaac Gym	3	853	October 12, 2021
Error PxgCudaDeviceMemoryAllocator fail to allocate memory 2147483648 bytes! Result = 2 Isaac Sim cuda , isaacsim , gym	6	2333	April 25, 2023
Simulate does not return Physics Modeling (closed)	5	1592	February 25, 2015
Multiple isaac-sim containers on one GPU fails with CUDA illegal memory access in [omni.physx.tensors.plugin] Isaac Sim	5	2363	November 17, 2023
Isaac Sim Freezes on URDF with 690 Joints (Mecanum Wheels Belt) Isaac Sim physx , isaac-sim-v4-5-0	9	228	September 30, 2025
Help Needed: PhysX GPU Kernel Launch Errors！！ Isaac Sim cuda , kernel , pytorch , physx , isaac-sim-v4-2-0	2	473	December 20, 2024
Frequent crashes in Isaac Sim Isaac Sim	7	1537	January 14, 2022
The App is Crashing when i try to simulate my Robot Isaac Sim cuda , physx , isaac-sim-v4-2-0	11	376	April 16, 2025
Segmentation fault (core dumped) when running gym.simulate(sim) Isaac Gym robotics , simulation	4	3518	September 21, 2021

`gym.simulate()` hanging on `__pthread_cond_wait` in `physx::PxSyncImpl::wait(unsigned int)`

Related topics