Compute_observations() in the custom task

Hi all,

is there any rule of thumb for the obs_buf? Is it always better to put all the information there?
For example, my task is about reaching a sphere with the end-effector tip. Here, I defined the obs_buf as below:

   to_target = self.sphere_poses - self.my_robot_tip_pos

    self.obs_buf[..., 0:5] = dof_pos_scaled
    self.obs_buf[..., 5:10] = self.my_robot_dof_vel * self.dof_vel_scale
    self.obs_buf[..., 10:13] = to_target
    self.obs_buf[..., 13:16] = self.sphere_poses

But is there any other rule of thumb for designing the obs_buf?

1 Like

Hi @hosei2,

Your choice of observations looks good to me. Adding robot end-effector orientation could be helpful as well. I also found that adding past actions to the observation helped learning in a lot of cases.

A choice of the observations often depends on your goals - if you have a real robot and you’d like to perform sim2real your choice of the observations is limited to what is available on a real robot and from it’s surrounding. But even in this case you can use asymmetric PPO version for training and a full set of observations to pass to the value function, see Shadow Hand environment as an example.

If you don’t have a sim2real goal sharing all information provided by a simulator, and sometimes even with hand-crafted features would be a good first step. It depends on the complexity of the task as well - simpler tasks can be usually solved with a very limited set of observations. And having less observations allows using smaller networks and fater training.

Also, the reward is very important, I’d say often it’s more important than observations choice.

I also found that adding past actions to the observation helped learning in a lot of cases.

Could you explain the implementation of this? I have seen this done in research papers, but I found that it did not work at all in my own implementation.

In the compute_observation function I would input self.actions which comes from pre_physics_steps with self.actions = actions.clone().to(self.device) ( does it like this), but the results were never good. Is that the right approach to go about it, or should it be the scaled actions that you input to the actuators of your robot?

There could be different approaches, but the most simple is just copying previous actions similar to what you’ve described. But your agent doesn’t train well most likely the reason is not in how past actions are copied but with the reward and other observations themselves. The same humanoid env can train quite well even without adding past actions to the observations.

1 Like

Alright, thank you, I will give it another shot some day.

The strange thing was that my initial test with the actions copied to the observations was really bad, but with the same reward and observations without actions it was able to train quite well for the task I have created.

Can you confirm that you copied the actions, produced by the policy, usually in range -1 1, not the actuation forces applied to the joints? It is the first reason I can think of, why training became worse. With past actions in the range, -1 1 passed as observations in the worst case, the performance should be the same.

1 Like

I did not find the time to test it yet but will check the values and make another attempt when I am back in the office next week. But I do remember using the self.actions variable which should be in the range [-1, 1] and before it is scaled to a force, etc.