Segmentation fault (core dumped) when running gym.simulate(sim)

lichothu · August 12, 2021, 6:20pm

I was trying to create an environment with an articulated object and planning to train a policy to interact with such articulated objects. For now, the environment contains a half-open drawer and the robot is trying to close/open the drawer. I have created 2048 environments in total in the simulator. But when I forward some actions to the robot, I got a segmentation fault after several steps. I looked into the code and found that the segfault happens within the gym.simulate(sim) step. How should I debug this error? (I was using PHYSX as my physics engine and simulate everything on GPU. I already have a working Vulkan driver. The interesting fact is that if I reduce the number of the environment such as only using one environment to train, then the segfault error is gone. Another fact is that if I put the robot far away from the object so it never touches the object, then the segfault is gone. So this is related to the contact simulation. )

mihai.anca13 · August 13, 2021, 9:08pm

Hi @lichothu.

I had similar issues when working on my environment. I believe the issue comes from either calling gym.simulate when you have nan values inside the simulator or after you’ve reset objects using the wrong syntax.

The nan values appeared in my case when the objects were colliding on spawn and then being shot into space.

What I mean by syntax is the variables that you use to reset dof positions, but also the id variable which must be instantiated. It would be a bit easier if I could see some of your code.

I would suggest setting up a dummy script that just resets your environment continuously. You can then track which line before the simulate() call causes the troubles.

Hope this helps,
Mihai

lichothu · August 16, 2021, 5:53am

Hi @mihai.anca13

Thanks for replying to me. I double-checked whether I have nan value during the reset and it seems that every value looks okay. I notice that the segfault happens after several reset of the environment and if the gripper is far away from the articulated object then the segfault does not occur. So maybe I did something wrong in the reset method. Here is the code for reset.

def reset(self, env_ids):
    print("------------------------------------reset")
    self.task_state = -1

    # reset object
    # reset object dof
    self.object_dof_state[env_ids, :, 1] = torch.zeros_like(self.object_dof_state[env_ids, :, 1])
    self.object_dof_state[env_ids, :, 0] = ((to_torch(self.object_dof_upper_limits, device=self.device) + to_torch(self.object_dof_lower_limits, device=self.device))*0.5).repeat((self.num_envs, 1))

    # reset franka
    pos = tensor_clamp(self.franka_default_dof_pos.unsqueeze(0), self.franka_dof_lower_limits, self.franka_dof_upper_limits)

    self.franka_dof_pos[env_ids, :] = pos
    self.franka_dof_vel[env_ids, :] = torch.zeros_like(self.franka_dof_vel[env_ids])
    self.franka_dof_targets[env_ids, :self.num_franka_dofs] = pos
    self.root_state_tensor[self.franka_actor_idxs[env_ids]] = self.valid_init_state[env_ids].clone()

    # reset franka actor
    franka_indices = self.franka_actor_idxs[env_ids].to(torch.int32)
    self.gym.set_actor_root_state_tensor_indexed(self.sim, gymtorch.unwrap_tensor(self.root_state_tensor), gymtorch.unwrap_tensor(franka_indices), len(franka_indices))

    # reset franka dof
    self.gym.set_dof_state_tensor(self.sim, gymtorch.unwrap_tensor(self.dof_state))
    franka_indices = self.franka_actor_idxs.to(torch.int32)
    self.gym.set_dof_position_target_tensor_indexed(self.sim, gymtorch.unwrap_tensor(self.franka_dof_targets), gymtorch.unwrap_tensor(franka_indices), len(franka_indices))

    self.progress_buf[env_ids] = 0
    self.reset_buf[env_ids] = 0

mihai.anca13 · August 16, 2021, 7:37am

Hi @lichothu

I am not sure if this is the problem, but I spotted two things:

you are using set_dof_state_tensor, which affects all environments. Consider using the _indexed version and giving the correct ids for both the robot arm and the cabinets.
both the cabinets and robot arm must have their position target and state reset. The dof state represents the current joint positions, while the target represents the IK target for where the joint should move to. If you fail to reset both, the cabinet/arm will move straight after reset towards the previous goal. It could also cause your crash.

I can see that you have no call to simulate or refresh inside the reset function. The way the reset is handled in the examples is faulty to some extent. I would recommend trying to introduce those calls so that the next observation you calculate is accurate and using the latest numbers.

Please keep me updated!

lichothu · September 21, 2021, 2:33am

Hi @mihai.anca13
I am so sorry. I decided to ignore the segfault and move on to my algorithm.
I rechecked my reset function.

I used the set_dof_state_tensor because, in my setup, all environments are reset simotanuously. So I think this shouldn’t be the issue.
I rewrite the reset function and make sure that position target and state are all being set. But the segfault error still exists.
One important fact I found is that, if from the last interaction, the drawer is not moved (don’t need to reset the drawer dof state), then the segfault error is gone.

Here is the new reset function.

def reset(self, env_ids):
    print("------------------------------------reset")
    self.task_state = -1

    # reset object
    # reset object dof
    self.object_dof_state[env_ids, :, 1] = torch.zeros_like(self.object_dof_state[env_ids, :, 1])
    self.object_dof_state[env_ids, :, 0] = self.object_init_dof_pos.clone()

    # reset franka
    pos = tensor_clamp(self.franka_default_dof_pos.unsqueeze(0), self.franka_dof_lower_limits,
                       self.franka_dof_upper_limits)

    self.franka_dof_pos[env_ids, :] = pos
    self.franka_dof_vel[env_ids, :] = torch.zeros_like(self.franka_dof_vel[env_ids])
    self.franka_dof_targets[env_ids, :self.num_franka_dofs] = pos
    # self.franka_dof_targets[env_ids, self.num_franka_dofs:] = self.object_init_dof_pos.clone()
    self.root_state_tensor[self.franka_actor_idxs[env_ids]] = self.valid_init_state[env_ids].clone()

    # reset object actor
    object_indices = self.object_actor_idxs[env_ids].to(torch.int32)
    self.gym.set_actor_root_state_tensor_indexed(self.sim, gymtorch.unwrap_tensor(self.root_state_tensor),
                                                 gymtorch.unwrap_tensor(object_indices), len(object_indices))
    # reset franka actor
    franka_indices = self.franka_actor_idxs[env_ids].to(torch.int32)
    self.gym.set_actor_root_state_tensor_indexed(self.sim, gymtorch.unwrap_tensor(self.root_state_tensor),
                                                 gymtorch.unwrap_tensor(franka_indices), len(franka_indices))


    # reset franka dof
    self.gym.set_dof_state_tensor(self.sim, gymtorch.unwrap_tensor(self.dof_state))
    franka_indices = self.franka_actor_idxs.to(torch.int32)
    self.gym.set_dof_position_target_tensor_indexed(self.sim, gymtorch.unwrap_tensor(self.franka_dof_targets),
                                                    gymtorch.unwrap_tensor(franka_indices), len(franka_indices))
    # reset object dof
    self.gym.set_dof_position_target_tensor_indexed(self.sim, gymtorch.unwrap_tensor(self.franka_dof_targets),
                                                    gymtorch.unwrap_tensor(object_indices), len(object_indices))

    self.progress_buf[env_ids] = 0
    self.reset_buf[env_ids] = 0

Topic		Replies	Views
Segmentation fault at gym.draw_viewer Isaac Gym	3	1000	March 4, 2024
Segmentation fault (core dumped) while running in docker container Isaac Gym ubuntu , docker	1	426	November 20, 2024
Gym.render_all_camera_sensors(sim) always causes segmentation fault (simple script to reproduce) Isaac Gym	2	875	September 23, 2024
Robot reset failing if in contact or simultaneously resetting object since v2022.2.0 Isaac Sim	16	1752	October 6, 2023
Segmentation fault after gym.create_actor Isaac Gym	7	1766	February 20, 2021
Segmentation fault with joint_monkey.py and my URDF file Isaac Gym urdf-importer , joint-monkey	4	1313	May 20, 2021
Ubuntu 20.04 libpython3.7m.so.1.0 error + segmentation fault Isaac Gym	6	3516	October 12, 2021
Non-Deterministic Behavior when Setting Data via Tensor API Isaac Gym	3	846	June 1, 2022
Segmentation Fault when running non headless in Docker container Isaac Sim	2	571	December 5, 2023
Segmentation Fault after creating Pinocchio robot Isaac Sim	17	1060	October 31, 2023

Segmentation fault (core dumped) when running gym.simulate(sim)

Related topics