Clarifications about resetting environments

From the Stable Baselines page on Vectorized Environments (Vectorized Environments — Stable Baselines 2.10.3a0 documentation):

When using vectorized environments, the environments are automatically reset at the end of each episode. Thus, the observation returned for the i-th environment when done[i] is true will in fact be the first observation of the next episode, not the last observation of the episode that has just terminated. You can access the “real” final observation of the terminated episode—that is, the one that accompanied the done event provided by the underlying environment—using the terminal_observation keys in the info dicts returned by the vecenv.

Am I correct in assuming that resets in Isaac Gym do NOT work like this?
In other words, an environment returns a terminal observation and resets itself on the following step, discarding the action resulting from the terminal observation. (This seems to be the case since in each tasks’s post_physics_step(...), self.reset(...) is called before the self.reset_buf is updated.)

Additionally, is there any point in setting self.reset_buf[env_ids] = 0 at the end of self.reset(...)? It seems like this value just gets overwritten when compute_reward(...) is called later in post_physics_step(...). Additionally, at the end of self.reset(...) in anymal.py, there is the line self.reset_buf[env_ids] = 1 instead of self.reset_buf[env_ids] = 0. Why is this? Does it just not matter?

1 Like

Yes, you are correct in that if a reset in the environment is necessary (as indicated by reset_buf = 1), the environment will be reset on the next step and actions from the terminal observation are discarded. We set self.reset_buf[env_ids] = 0 at the end of reset to indicate that the environment has already been reset, and will not be reset again when reset_buf is checked the following step. It is possible that reset_buf is overwritten in compute_reward, but it is purely implementation dependent, and will lead to incorrect behaviour if compute_reward does not update the reset_buf for environments that don’t need to be reset. I believe the self.reset_buf[env_ids] = 1 is likely a bug, but the behaviour is not affected as the full reset buffer is computed in compute_anymal_reward.

1 Like

Thank you!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.