Segmentation fault (core dumped) when running gym.simulate(sim)

I was trying to create an environment with an articulated object and planning to train a policy to interact with such articulated objects. For now, the environment contains a half-open drawer and the robot is trying to close/open the drawer. I have created 2048 environments in total in the simulator. But when I forward some actions to the robot, I got a segmentation fault after several steps. I looked into the code and found that the segfault happens within the gym.simulate(sim) step. How should I debug this error? (I was using PHYSX as my physics engine and simulate everything on GPU. I already have a working Vulkan driver. The interesting fact is that if I reduce the number of the environment such as only using one environment to train, then the segfault error is gone. Another fact is that if I put the robot far away from the object so it never touches the object, then the segfault is gone. So this is related to the contact simulation. )

Hi @lichothu.

I had similar issues when working on my environment. I believe the issue comes from either calling gym.simulate when you have nan values inside the simulator or after you’ve reset objects using the wrong syntax.

The nan values appeared in my case when the objects were colliding on spawn and then being shot into space.

What I mean by syntax is the variables that you use to reset dof positions, but also the id variable which must be instantiated. It would be a bit easier if I could see some of your code.

I would suggest setting up a dummy script that just resets your environment continuously. You can then track which line before the simulate() call causes the troubles.

Hope this helps,
Mihai

Hi @mihai.anca13

Thanks for replying to me. I double-checked whether I have nan value during the reset and it seems that every value looks okay. I notice that the segfault happens after several reset of the environment and if the gripper is far away from the articulated object then the segfault does not occur. So maybe I did something wrong in the reset method. Here is the code for reset.

def reset(self, env_ids):
    print("------------------------------------reset")
    self.task_state = -1

    # reset object
    # reset object dof
    self.object_dof_state[env_ids, :, 1] = torch.zeros_like(self.object_dof_state[env_ids, :, 1])
    self.object_dof_state[env_ids, :, 0] = ((to_torch(self.object_dof_upper_limits, device=self.device) + to_torch(self.object_dof_lower_limits, device=self.device))*0.5).repeat((self.num_envs, 1))

    # reset franka
    pos = tensor_clamp(self.franka_default_dof_pos.unsqueeze(0), self.franka_dof_lower_limits, self.franka_dof_upper_limits)

    self.franka_dof_pos[env_ids, :] = pos
    self.franka_dof_vel[env_ids, :] = torch.zeros_like(self.franka_dof_vel[env_ids])
    self.franka_dof_targets[env_ids, :self.num_franka_dofs] = pos
    self.root_state_tensor[self.franka_actor_idxs[env_ids]] = self.valid_init_state[env_ids].clone()

    # reset franka actor
    franka_indices = self.franka_actor_idxs[env_ids].to(torch.int32)
    self.gym.set_actor_root_state_tensor_indexed(self.sim, gymtorch.unwrap_tensor(self.root_state_tensor), gymtorch.unwrap_tensor(franka_indices), len(franka_indices))

    # reset franka dof
    self.gym.set_dof_state_tensor(self.sim, gymtorch.unwrap_tensor(self.dof_state))
    franka_indices = self.franka_actor_idxs.to(torch.int32)
    self.gym.set_dof_position_target_tensor_indexed(self.sim, gymtorch.unwrap_tensor(self.franka_dof_targets), gymtorch.unwrap_tensor(franka_indices), len(franka_indices))

    self.progress_buf[env_ids] = 0
    self.reset_buf[env_ids] = 0

Hi @lichothu

I am not sure if this is the problem, but I spotted two things:

  • you are using set_dof_state_tensor, which affects all environments. Consider using the _indexed version and giving the correct ids for both the robot arm and the cabinets.
  • both the cabinets and robot arm must have their position target and state reset. The dof state represents the current joint positions, while the target represents the IK target for where the joint should move to. If you fail to reset both, the cabinet/arm will move straight after reset towards the previous goal. It could also cause your crash.

I can see that you have no call to simulate or refresh inside the reset function. The way the reset is handled in the examples is faulty to some extent. I would recommend trying to introduce those calls so that the next observation you calculate is accurate and using the latest numbers.

Please keep me updated!

Hi @mihai.anca13
I am so sorry. I decided to ignore the segfault and move on to my algorithm.
I rechecked my reset function.

  • I used the set_dof_state_tensor because, in my setup, all environments are reset simotanuously. So I think this shouldn’t be the issue.

  • I rewrite the reset function and make sure that position target and state are all being set. But the segfault error still exists.

  • One important fact I found is that, if from the last interaction, the drawer is not moved (don’t need to reset the drawer dof state), then the segfault error is gone.

Here is the new reset function.

def reset(self, env_ids):
    print("------------------------------------reset")
    self.task_state = -1

    # reset object
    # reset object dof
    self.object_dof_state[env_ids, :, 1] = torch.zeros_like(self.object_dof_state[env_ids, :, 1])
    self.object_dof_state[env_ids, :, 0] = self.object_init_dof_pos.clone()

    # reset franka
    pos = tensor_clamp(self.franka_default_dof_pos.unsqueeze(0), self.franka_dof_lower_limits,
                       self.franka_dof_upper_limits)

    self.franka_dof_pos[env_ids, :] = pos
    self.franka_dof_vel[env_ids, :] = torch.zeros_like(self.franka_dof_vel[env_ids])
    self.franka_dof_targets[env_ids, :self.num_franka_dofs] = pos
    # self.franka_dof_targets[env_ids, self.num_franka_dofs:] = self.object_init_dof_pos.clone()
    self.root_state_tensor[self.franka_actor_idxs[env_ids]] = self.valid_init_state[env_ids].clone()

    # reset object actor
    object_indices = self.object_actor_idxs[env_ids].to(torch.int32)
    self.gym.set_actor_root_state_tensor_indexed(self.sim, gymtorch.unwrap_tensor(self.root_state_tensor),
                                                 gymtorch.unwrap_tensor(object_indices), len(object_indices))
    # reset franka actor
    franka_indices = self.franka_actor_idxs[env_ids].to(torch.int32)
    self.gym.set_actor_root_state_tensor_indexed(self.sim, gymtorch.unwrap_tensor(self.root_state_tensor),
                                                 gymtorch.unwrap_tensor(franka_indices), len(franka_indices))


    # reset franka dof
    self.gym.set_dof_state_tensor(self.sim, gymtorch.unwrap_tensor(self.dof_state))
    franka_indices = self.franka_actor_idxs.to(torch.int32)
    self.gym.set_dof_position_target_tensor_indexed(self.sim, gymtorch.unwrap_tensor(self.franka_dof_targets),
                                                    gymtorch.unwrap_tensor(franka_indices), len(franka_indices))
    # reset object dof
    self.gym.set_dof_position_target_tensor_indexed(self.sim, gymtorch.unwrap_tensor(self.franka_dof_targets),
                                                    gymtorch.unwrap_tensor(object_indices), len(object_indices))

    self.progress_buf[env_ids] = 0
    self.reset_buf[env_ids] = 0