PPO for rl_games vs skrl

Hi, I am using the PPO for Isaac-Lift-Franka-v0, which is a task in NVIDIA Isaac Orbit. I found that the performance of PPO in rl_games is much better than skrl. So, I have tried to adjust the skrl’s parameterS as same as rl_games’s, but it does not work. I am wondering if I overlook something or if the architecture of PPO in rl_games and skrl are fundamentally different. Could you provide me any advice or insight about how to make the PPO in skrl performs as well as rl_games?

Thank you in advance.

Here are the PPO parameters for skrl:

seed: 42

# Models are instantiated using skrl's model instantiator utility
# https://skrl.readthedocs.io/en/develop/modules/skrl.utils.model_instantiators.html
models:
  separate: False
  policy:  # see skrl.utils.model_instantiators.gaussian_model for parameter details
    clip_actions: True
    clip_log_std: False
    min_log_std: -20.0
    max_log_std: 2.0
    input_shape: "Shape.STATES"
    hiddens: [256, 128, 64]
    hidden_activation: ["elu", "elu", "elu"]
    output_shape: "Shape.ACTIONS"
    output_activation: ""
    output_scale: 1.0
  value:  # see skrl.utils.model_instantiators.deterministic_model for parameter details
    clip_actions: False
    input_shape: "Shape.STATES"
    hiddens: [256, 128, 64]
    hidden_activation: ["elu", "elu", "elu"]
    output_shape: "Shape.ONE"
    output_activation: ""
    output_scale: 1.0


# PPO agent configuration (field names are from PPO_DEFAULT_CONFIG)
# https://skrl.readthedocs.io/en/latest/modules/skrl.agents.ppo.html
agent:
  rollouts: 32
  learning_epochs: 5
  mini_batches: 16
  discount_factor: 0.99
  lambda: 0.95
  learning_rate: 5.e-4
  learning_rate_scheduler: "KLAdaptiveRL"
  learning_rate_scheduler_kwargs:
    kl_threshold: 0.008
  state_preprocessor: "RunningStandardScaler"
  state_preprocessor_kwargs: {"size": env.observation_space, "device": device}
  value_preprocessor: "RunningStandardScaler"
  value_preprocessor_kwargs: {"size": 1, "device": device}
  random_timesteps: 0
  learning_starts: 0
  grad_norm_clip: 1.0
  ratio_clip: 0.2
  value_clip: 0.2
  clip_predicted_values: True
  entropy_loss_scale: 0.0
  value_loss_scale: 4.0
  kl_threshold: 0
  # rewards_shaper_scale: 0.01
  # logging and checkpoint
  experiment:
    directory: "lift"
    experiment_name: ""
    write_interval: 120
    checkpoint_interval: 200


# Sequential trainer
# https://skrl.readthedocs.io/en/latest/modules/skrl.trainers.sequential.html
trainer:
  timesteps: 240000

Here are the PPO parameters for rl_games:

params:
  seed: 42

  # environment wrapper clipping
  env:
    clip_observations: 10.0
    clip_actions: 1.0

  algo:
    name: a2c_continuous

  model:
    name: continuous_a2c_logstd

  network:
    name: actor_critic
    separate: False
    space:
      continuous:
        mu_activation: None
        sigma_activation: None

        mu_init:
          name: default
        sigma_init:
          name: const_initializer
          val: 0
        fixed_sigma: True
    mlp:
      units: [256, 128, 64]
      # units: [512, 256, 128]
      activation: elu
      d2rl: False

      initializer:
        name: default
      regularizer:
        name: None

  load_checkpoint: False # flag which sets whether to load the checkpoint
  load_path: '' # path to the checkpoint to load

  config:
    name: lift
    env_name: rlgpu
    device: 'cuda:0'
    device_name: 'cuda:0'
    multi_gpu: False
    ppo: True
    mixed_precision: False
    normalize_input: True
    normalize_value: True
    value_bootstrap: True
    num_actors: -1
    reward_shaper:
      scale_value: 1.0
    normalize_advantage: True
    gamma: 0.99
    tau: 0.95
    learning_rate: 5e-4
    lr_schedule: adaptive
    schedule_type: legacy
    kl_threshold: 0.008
    score_to_win: 10000
    max_epochs: 10000
    save_best_after: 20
    save_frequency: 20
    print_stats: True
    grad_norm: 1.0
    entropy_coef: 0.0
    truncate_grads: True
    e_clip: 0.2
    horizon_length: 32
    minibatch_size: 2048
    mini_epochs: 5
    critic_coef: 4
    clip_value: True
    seq_lenqq: 4
    bounds_loss_coef: 0.0001

Hi @berternats

Please, check the PPO for rl_games vs skrl · Toni-SM/skrl · Discussion #103 · GitHub to continue the discussion there :)

Hi, thank you so much for the prompt reply.

Regarding what maximum mean reward values I am getting with rl_games and skrl: the task is about a robot trying to grasp an object and lift it up.

The rl_games can complete the task, while in skrl the robot is only able to reach the object and cannot grasp it.

So the reward difference is big because the performance.

Hi @berternats

When the Isaac-Lift-Franka-v0 environment reward function was fixed, only the rl_games and rsl_rl hyperparameters were updated.

Although, for Isaac Orbit, I use, as far as possible, the rl_games hyperparameters, I have updated (in skrl-v1.0.0-rc.2, released recently), the hyperparameters for the Isaac-Lift-Franka-v0 environment but this time based on rsl_rl. Furthermore, I have added, to the last version, the time-limit (episode truncation) bootstrapping to skrl’s on-policy agents, which allows for better mean reward values.

The next plot shows the mean reward for the Isaac-Lift-Franka-v0 environment for the mentioned libraries (and with the skrl hyperparameters updated, not published yet in Isaac Orbit code). Note that, since the number of parallel environments for the lift tasks was increased from 1024 to 4096, rl_games, with the available hyperparameters, takes much longer to train.

I am working on the skrl integration in the Isaac Orbit repository (to be pushed soon) which will include JAX support and an update of the training hyperparameters.

Meanwhile, you can play with the standalone training script for Isaac Orbit from the skrl docs: torch_lift_franka_ppo.py

Thank you so much for the benchmarking. I really appreciate that.

I will try with the parameters you provided.

By the way, may I know which parameters you were playing to get the results shown in Figure 2?

Were you using the adult controller or the IK controller?

Hi @berternats

Both implementations for skrl (Figure 1 and 2) use the same hyperparameters:
Note that the initial_log_std and time_limit_bootstrap are not available in the current public version of Isaac Orbit.

seed: 42

# Models are instantiated using skrl's model instantiator utility
# https://skrl.readthedocs.io/en/latest/api/utils/model_instantiators.html
models:
  separate: False
  policy:  # see skrl.utils.model_instantiators.gaussian_model for parameter details
    clip_actions: False
    clip_log_std: True
    min_log_std: -20.0
    max_log_std: 2.0
    initial_log_std: 1.0
    input_shape: "Shape.STATES"
    hiddens: [256, 128, 64]
    hidden_activation: ["elu", "elu", "elu"]
    output_shape: "Shape.ACTIONS"
    output_activation: ""
    output_scale: 1.0
  value:  # see skrl.utils.model_instantiators.deterministic_model for parameter details
    clip_actions: False
    input_shape: "Shape.STATES"
    hiddens: [256, 128, 64]
    hidden_activation: ["elu", "elu", "elu"]
    output_shape: "Shape.ONE"
    output_activation: ""
    output_scale: 1.0


# PPO agent configuration (field names are from PPO_DEFAULT_CONFIG)
# https://skrl.readthedocs.io/en/latest/api/agents/ppo.html
agent:
  rollouts: 96
  learning_epochs: 5
  mini_batches: 4
  discount_factor: 0.99
  lambda: 0.95
  learning_rate: 1.e-3
  learning_rate_scheduler: "KLAdaptiveRL"
  learning_rate_scheduler_kwargs:
    kl_threshold: 0.01
    min_lr: 1.e-5
  state_preprocessor: "RunningStandardScaler"
  state_preprocessor_kwargs: null
  value_preprocessor: "RunningStandardScaler"
  value_preprocessor_kwargs: null
  random_timesteps: 0
  learning_starts: 0
  grad_norm_clip: 1.0
  ratio_clip: 0.2
  value_clip: 0.2
  clip_predicted_values: True
  entropy_loss_scale: 0.01
  value_loss_scale: 1.0
  kl_threshold: 0
  rewards_shaper_scale: 1.0
  time_limit_bootstrap: True
  # logging and checkpoint
  experiment:
    directory: "lift"
    experiment_name: ""
    write_interval: 800
    checkpoint_interval: 8000


# Sequential trainer
# https://skrl.readthedocs.io/en/latest/api/trainers/sequential.html
trainer:
  timesteps: 67200

Regarding to the task parameters, Figure 2 uses the default task parameters as defined in Isaac Orbit lift_cfg.py file

Thank you so much!

And, I am sorry for an unrelated topic, when I am using skrl for training, such as *Isaac-Lift-Franka-v0, the training will stop halfway, even though it is still far away from exceeding the available GPU memory.

The error is like there is an error running Python. something like that.

May I know if you have any clue?

Thank you in advance.

Hi @berternats

It is difficult to know without a specific error message.

Can you provide the error message or logs?
Have you made any modifications to the task?
Are you using the latest skrl version?

  1. Can you provide the error message or logs?
    I have attached the error message. The training stopped suddenly.
  2. Have you made any modifications to the task?
    I am using my own environment. But it seems only skrl has this problem. rl_games is working well with the environment.
  3. Are you using the latest skrl version?
    I am using the 0.10.0 version.

Thank you in advance!

Hi @berternats

Mmmm, I have never had these types of problems with Isaac Orbit, but perhaps it could be something similar to what is described in the following discussion (which is fixed in latest skrl versions).

  • Can you try the latest version (skrl-v1.0.0-rc.2)?
  • Are you running the example scripts included in the skrl (e.g.: torch_lift_franka_ppo.py), or the examples integrated in Isaac Orbit?

Thank you so much. I will give the latest skrl a try.

I and using the examples integrated into Isaac Orbit.

Hi, may I know is the latest version, skrl-v1.0.0, is ready-to-use for Issac Orbit?
Previously Isaac Orbit supports the skrl-v0.10.2 version, so now is skrl-v1.0.0 ready-to-use for Issac Orbit? Do we need to modify anything before using it?

Thank you so much.

Hi @berternats

I submitted the skrl JAX by Toni-SM · Pull Request #109 · NVIDIA-Omniverse/Orbit · GitHub to Isaac Orbit repository that uses the latest version (skrl>=1.0.0).

Waiting for approval :)

Meanwhile, you can play with Isaac Orbit environments via the skrl’s standalone scripts for Isaac Orbit.

I tried the latest version in Orbit, where I just changed the structure environment loaders and wrappers file hierarchy (link attached).

It runs successfully. However, with exactly the same parameters (parameters you provided) and environments (same tasks), 1.0.0 and 0.10.2 their performance are quite different, in which in 0.10.2, my robot can successfully grasp the object, whereas 1.0.0 absolutely does not.

May I know if you have any clue? Is there anything I missed?

Thank you in advance.

Hello, May I know any update on this?