I’ve been working with this custom environment I created.
When I test this environment with Stable Baselines3 or even on IsaacGym with a single instance of the environment, the optimal policy is found and it quickly converges to the end goal.
However, when I run multiple environments distributed over multiple gpus, the performance drops significantly in terms of actor/critic convergence, and the rewards it collects.
I am not even talking about 1000+ environments, even with 3-6 parallel environments on 3 GPU nodes, the policy does not converge and reward it accumulates is roughly only a half of what the single environment agent collects.
I’ve created SAC agent with multi gpu support (pytorch distributed), and currently the model params are synchronized before the optimizer.step() is called and after the loss backward is calculated. The params are synced via reduced sum ops.
Or does the reward function defined for single environment need to be modified to suit for multi environments? I’ve tested with reward function too, and it does change the learning curves but again no convergence with multiple parallel enviromnents.
Any similar experiences and insights will be really appreciated! Thanks