Gym cuda error: running out of memory

I am using RTX3060 with 6GB memory to run IsaacGymEnvs exemplary tasks like Ant or Anymal and come across Cuda run out of memory issue:

[Error] [carb.gym.plugin] Gym cuda error: out of memory: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 1718
[Error] [carb.gym.plugin] Gym cuda error: invalid resource handle: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 6003
[Error] [carb.gym.plugin] Gym cuda error: out of memory: ../../../source/plugins/carb/gym/impl/Gym/GymPhysXCuda.cu: 991
[Error] [carb.gym.plugin] Gym cuda error: invalid resource handle: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 5859
Error executing job with overrides: ['task=Ant']
Traceback (most recent call last):
  File "train.py", line 112, in launch_rlg_hydra
    'play': cfg.test,
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 139, in run
    self.run_train()
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 122, in run_train
    agent = self.algo_factory.create(self.algo_name, base_name='run', config=self.config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/object_factory.py", line 15, in create
    return builder(**kwargs)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 23, in <lambda>
    self.algo_factory.register_builder('a2c_continuous', lambda **kwargs : a2c_continuous.A2CAgent(**kwargs))
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/algos_torch/a2c_continuous.py", line 18, in __init__
    a2c_common.ContinuousA2CBase.__init__(self, base_name, config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/a2c_common.py", line 973, in __init__
    A2CBase.__init__(self, base_name, config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/a2c_common.py", line 84, in __init__
    self.vec_env = vecenv.create_vec_env(self.env_name, self.num_actors, **self.env_config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/vecenv.py", line 282, in create_vec_env
    return vecenv_config[vec_env_name](config_name, num_actors, **kwargs)
  File "train.py", line 90, in <lambda>
    lambda config_name, num_actors, **kwargs: RLGPUEnv(config_name, num_actors, **kwargs))
  File "/home/jrenaf/IsaacGymEnvs/isaacgymenvs/utils/rlgames_utils.py", line 159, in __init__
    self.env = env_configurations.configurations[config_name]['env_creator'](**kwargs)
  File "train.py", line 93, in <lambda>
    'env_creator': lambda **kwargs: create_rlgpu_env(**kwargs),
  File "/home/jrenaf/IsaacGymEnvs/isaacgymenvs/utils/rlgames_utils.py", line 91, in create_rlgpu_env
    headless=headless
  File "/home/jrenaf/IsaacGymEnvs/isaacgymenvs/tasks/ant.py", line 97, in __init__
    zero_tensor = torch.tensor([0.0], device=self.device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

It seems like GPU memory is not enough for the training? What parameters should I tune to solve this issue?

Hi @jimingre

You can reduce the number of environments to create by using the --num_envs NUM_ENVS argument . This will reduce the memory allocated.

Example:

python train.py --task Anymal --num_envs 512

To make the change persistent you can edit the variable numEnvs in configuration files (.yaml) in PATH_TO_ISAAC_GYM/python/rlgpu/cfg folder.

Hi @toni.sm

Thanks for you advice. I changed num_envs to 100 but the training is still stopped abruptly.

task_name: Ant
experiment: 
num_envs: 100
seed: 42
torch_deterministic: False
max_iterations: 
physics_engine: physx
pipeline: gpu
sim_device: cuda:0
rl_device: cuda:0
graphics_device_id: 0
num_threads: 4
solver_type: 1
num_subscenes: 4
test: False
checkpoint: 
multi_gpu: False
headless: False
Setting seed: 42
Started to train
Exact experiment name requested from command line: Ant
/home/jrenaf/.local/lib/python3.6/site-packages/gym/spaces/box.py:74: UserWarning: WARN: Box bound precision lowered by casting to float32
  "Box bound precision lowered by casting to {}".format(self.dtype)
[Warning] [carb.gym.plugin] useGpu is set, forcing single scene (0 subscenes)
Not connected to PVD
+++ Using GPU PhysX
Physics Engine: PhysX
Physics Device: cuda:0
GPU Pipeline: enabled
num envs 100 env spacing 5
Box([-1. -1. -1. -1. -1. -1. -1. -1.], [1. 1. 1. 1. 1. 1. 1. 1.], (8,), float32) Box([-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf], (60,), float32)
Env info:
{'action_space': Box([-1. -1. -1. -1. -1. -1. -1. -1.], [1. 1. 1. 1. 1. 1. 1. 1.], (8,), float32), 'observation_space': Box([-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf], (60,), float32)}
Error executing job with overrides: ['task=Ant', 'num_envs=100']
Traceback (most recent call last):
  File "train.py", line 112, in launch_rlg_hydra
    'play': cfg.test,
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 139, in run
    self.run_train()
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 122, in run_train
    agent = self.algo_factory.create(self.algo_name, base_name='run', config=self.config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/object_factory.py", line 15, in create
    return builder(**kwargs)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 23, in <lambda>
    self.algo_factory.register_builder('a2c_continuous', lambda **kwargs : a2c_continuous.A2CAgent(**kwargs))
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/algos_torch/a2c_continuous.py", line 18, in __init__
    a2c_common.ContinuousA2CBase.__init__(self, base_name, config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/a2c_common.py", line 973, in __init__
    A2CBase.__init__(self, base_name, config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/a2c_common.py", line 171, in __init__
    assert(self.batch_size % self.minibatch_size == 0)
AssertionError

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

However, if I set pipeline to CPU, it works fine.

Hi @jimingre

You are getting this exception (in rl_games) because the batch_size is not a multiple of minibatch_size.

I checked the values for 100, 512 and 1024 (default) environments for the Ant tasks:

  • 1024 environments (default) [yes]
    batch_size = 16384
    minibatch_size = 8192

  • 512 environments (minimum amount of environment to match the condition) [yes]
    batch_size = 8192
    minibatch_size = 8192

  • 100 environments [no]
    batch_size = 1600
    minibatch_size = 8192

Solutions:

  1. try with 512 environments

  2. no idea (I have not explored rl_games in depth). So, you can open a new issue in rl_games repository 😅

Note: to test your error I had to install rl_games from pip (pip install rl-games==1.0.2) because the direct installation form GitHub (latest version) gives some errors related to the configuration before reaching the assert line 🤷‍♂️

Hi @toni.sm

I wonder how I can get access to the old version of IsaacGymEnvs?

Fix: For Ant, I changed to minibatch size to 8192 in configuration and now it works fine. But not for Anymal, it still has the problem of running out of memory.

Hi @jimingre

  1. Could you please try to install rl_games from pip using the following command?

    pip install rl-games==1.0.2
    # or
    python -m pip install rl-games==1.0.2
    

    You can test your current rl_games version using

    python -m pip show rl_games
    

    This version worked for me with 512 environments.

  2. Regarding the old version of IsaacGymEnvs, I don’t know how or if it is possible (I’m just a common user). The Isaac-Gym team could provide you with more information on this topic.

Hi @toni.sm

Thank you for that information.

My rl-games version is 1.1.3. Do you suggest me downgrading to 1.0.2?

Hi @jimingre

Yes. What was there to lose? You can always reinstall the latest version.
Well, at the end I don’t know anything about rl_games… 😅

Hi @toni.sm

There are some dependency issues. Maybe I will start a new thread on rl-games. Thanks for your help anyway!

hello,nice to meet you.
Did you solve it,I have same problem,I am using RTX 3060 6GB too.

Hi,

You might want to refer to this post which solves my problem

1 Like

Thanks a lot!!!