RL Performance drop as the number of environment increases

I’ve been working with this custom environment I created.
When I test this environment with Stable Baselines3 or even on IsaacGym with a single instance of the environment, the optimal policy is found and it quickly converges to the end goal.

However, when I run multiple environments distributed over multiple gpus, the performance drops significantly in terms of actor/critic convergence, and the rewards it collects.
I am not even talking about 1000+ environments, even with 3-6 parallel environments on 3 GPU nodes, the policy does not converge and reward it accumulates is roughly only a half of what the single environment agent collects.

I’ve created SAC agent with multi gpu support (pytorch distributed), and currently the model params are synchronized before the optimizer.step() is called and after the loss backward is calculated. The params are synced via reduced sum ops.

Or does the reward function defined for single environment need to be modified to suit for multi environments? I’ve tested with reward function too, and it does change the learning curves but again no convergence with multiple parallel enviromnents.

Any similar experiences and insights will be really appreciated! Thanks

Hello @dong-jin.kim ,

It’s hard to say what could be wrong in your case or how to debug it as it could be related to SB3 implementation details. I have one general comment - there is no need in multi-gpu training if you are running less than 1K env per GPU. If you are running only 3-6 envs per GPU across 3 GPUs it might make sense to debug first on a single GPU with 9-18 envs or more. Also you could find useful to look into SAC training examples in isaacgymenvs.