Train with rl-games: Cuda out of memory

Hello,

When I was running exemplary task Anymal, I have come across the Cuda running out of memory problem showed as below. I have tried to reduce the size of minibatch to 8192 or even smaller and lower down num_envs to 512, but the running out of memory problem still exists. The tasks like Ant or others have also had similar problem on my laptop (GPU RTX 3060 6GB) but when I decreased the size of minibatch and num_envs, the issue went away. I thought the problem might come from rl-games, but not pretty sure.

[Error] [carb.gym.plugin] Gym cuda error: out of memory: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 1718
[Error] [carb.gym.plugin] Gym cuda error: invalid resource handle: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 6003
[Error] [carb.gym.plugin] Gym cuda error: out of memory: ../../../source/plugins/carb/gym/impl/Gym/GymPhysXCuda.cu: 991
[Error] [carb.gym.plugin] Gym cuda error: invalid resource handle: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 5859
[Error] [carb.gym.plugin] Gym cuda error: invalid resource handle: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 6099
[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/GymPhysXCuda.cu: 1019
Error executing job with overrides: ['task=Anymal', 'num_envs=512']
Traceback (most recent call last):
  File "train.py", line 112, in launch_rlg_hydra
    'play': cfg.test,
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 139, in run
    self.run_train()
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 122, in run_train
    agent = self.algo_factory.create(self.algo_name, base_name='run', config=self.config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/object_factory.py", line 15, in create
    return builder(**kwargs)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 23, in <lambda>
    self.algo_factory.register_builder('a2c_continuous', lambda **kwargs : a2c_continuous.A2CAgent(**kwargs))
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/algos_torch/a2c_continuous.py", line 18, in __init__
    a2c_common.ContinuousA2CBase.__init__(self, base_name, config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/a2c_common.py", line 973, in __init__
    A2CBase.__init__(self, base_name, config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/a2c_common.py", line 84, in __init__
    self.vec_env = vecenv.create_vec_env(self.env_name, self.num_actors, **self.env_config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/vecenv.py", line 282, in create_vec_env
    return vecenv_config[vec_env_name](config_name, num_actors, **kwargs)
  File "train.py", line 90, in <lambda>
    lambda config_name, num_actors, **kwargs: RLGPUEnv(config_name, num_actors, **kwargs))
  File "/home/jrenaf/IsaacGymEnvs/isaacgymenvs/utils/rlgames_utils.py", line 159, in __init__
    self.env = env_configurations.configurations[config_name]['env_creator'](**kwargs)
  File "train.py", line 93, in <lambda>
    'env_creator': lambda **kwargs: create_rlgpu_env(**kwargs),
  File "/home/jrenaf/IsaacGymEnvs/isaacgymenvs/utils/rlgames_utils.py", line 91, in create_rlgpu_env
    headless=headless
  File "/home/jrenaf/IsaacGymEnvs/isaacgymenvs/tasks/anymal.py", line 128, in __init__
    self.commands = torch.zeros(self.num_envs, 3, dtype=torch.float, device=self.device, requires_grad=False)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Do you want to reduce the number of neurons or buy a 12GB GPU?
In your case, why not reduce ‘num_envs’ further? The minimum value is ‘2’.

Hi @DDPG7

I shrink the size to 2 and set mini-batch size to 8, using pipeline=cpu and now it occasionally works.

Why minimal num_envs is 2? I set 1 and no errors pop up.

you could set the train parameter: headless = True ,when you train the example ‘Anymal’ ,

Anymal assets use full meshes so it could take up more memory. Try reducing num_envs further and also try in headless with headless=True as others mentioned.

Hello, how can I reduce ‘num_envs’ size to 2 ?