Gym cuda error: running out of memory

jimingre · October 31, 2021, 3:21am

I am using RTX3060 with 6GB memory to run IsaacGymEnvs exemplary tasks like Ant or Anymal and come across Cuda run out of memory issue:

[Error] [carb.gym.plugin] Gym cuda error: out of memory: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 1718
[Error] [carb.gym.plugin] Gym cuda error: invalid resource handle: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 6003
[Error] [carb.gym.plugin] Gym cuda error: out of memory: ../../../source/plugins/carb/gym/impl/Gym/GymPhysXCuda.cu: 991
[Error] [carb.gym.plugin] Gym cuda error: invalid resource handle: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 5859
Error executing job with overrides: ['task=Ant']
Traceback (most recent call last):
  File "train.py", line 112, in launch_rlg_hydra
    'play': cfg.test,
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 139, in run
    self.run_train()
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 122, in run_train
    agent = self.algo_factory.create(self.algo_name, base_name='run', config=self.config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/object_factory.py", line 15, in create
    return builder(**kwargs)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 23, in <lambda>
    self.algo_factory.register_builder('a2c_continuous', lambda **kwargs : a2c_continuous.A2CAgent(**kwargs))
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/algos_torch/a2c_continuous.py", line 18, in __init__
    a2c_common.ContinuousA2CBase.__init__(self, base_name, config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/a2c_common.py", line 973, in __init__
    A2CBase.__init__(self, base_name, config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/a2c_common.py", line 84, in __init__
    self.vec_env = vecenv.create_vec_env(self.env_name, self.num_actors, **self.env_config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/vecenv.py", line 282, in create_vec_env
    return vecenv_config[vec_env_name](config_name, num_actors, **kwargs)
  File "train.py", line 90, in <lambda>
    lambda config_name, num_actors, **kwargs: RLGPUEnv(config_name, num_actors, **kwargs))
  File "/home/jrenaf/IsaacGymEnvs/isaacgymenvs/utils/rlgames_utils.py", line 159, in __init__
    self.env = env_configurations.configurations[config_name]['env_creator'](**kwargs)
  File "train.py", line 93, in <lambda>
    'env_creator': lambda **kwargs: create_rlgpu_env(**kwargs),
  File "/home/jrenaf/IsaacGymEnvs/isaacgymenvs/utils/rlgames_utils.py", line 91, in create_rlgpu_env
    headless=headless
  File "/home/jrenaf/IsaacGymEnvs/isaacgymenvs/tasks/ant.py", line 97, in __init__
    zero_tensor = torch.tensor([0.0], device=self.device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

It seems like GPU memory is not enough for the training? What parameters should I tune to solve this issue?

toni.sm · October 31, 2021, 9:31am

Hi @jimingre

You can reduce the number of environments to create by using the --num_envs NUM_ENVS argument . This will reduce the memory allocated.

Example:

python train.py --task Anymal --num_envs 512

To make the change persistent you can edit the variable numEnvs in configuration files (.yaml) in PATH_TO_ISAAC_GYM/python/rlgpu/cfg folder.

jimingre · October 31, 2021, 5:47pm

Hi @toni.sm

Thanks for you advice. I changed num_envs to 100 but the training is still stopped abruptly.

task_name: Ant
experiment: 
num_envs: 100
seed: 42
torch_deterministic: False
max_iterations: 
physics_engine: physx
pipeline: gpu
sim_device: cuda:0
rl_device: cuda:0
graphics_device_id: 0
num_threads: 4
solver_type: 1
num_subscenes: 4
test: False
checkpoint: 
multi_gpu: False
headless: False
Setting seed: 42
Started to train
Exact experiment name requested from command line: Ant
/home/jrenaf/.local/lib/python3.6/site-packages/gym/spaces/box.py:74: UserWarning: WARN: Box bound precision lowered by casting to float32
  "Box bound precision lowered by casting to {}".format(self.dtype)
[Warning] [carb.gym.plugin] useGpu is set, forcing single scene (0 subscenes)
Not connected to PVD
+++ Using GPU PhysX
Physics Engine: PhysX
Physics Device: cuda:0
GPU Pipeline: enabled
num envs 100 env spacing 5
Box([-1. -1. -1. -1. -1. -1. -1. -1.], [1. 1. 1. 1. 1. 1. 1. 1.], (8,), float32) Box([-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf], (60,), float32)
Env info:
{'action_space': Box([-1. -1. -1. -1. -1. -1. -1. -1.], [1. 1. 1. 1. 1. 1. 1. 1.], (8,), float32), 'observation_space': Box([-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf], (60,), float32)}
Error executing job with overrides: ['task=Ant', 'num_envs=100']
Traceback (most recent call last):
  File "train.py", line 112, in launch_rlg_hydra
    'play': cfg.test,
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 139, in run
    self.run_train()
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 122, in run_train
    agent = self.algo_factory.create(self.algo_name, base_name='run', config=self.config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/object_factory.py", line 15, in create
    return builder(**kwargs)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 23, in <lambda>
    self.algo_factory.register_builder('a2c_continuous', lambda **kwargs : a2c_continuous.A2CAgent(**kwargs))
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/algos_torch/a2c_continuous.py", line 18, in __init__
    a2c_common.ContinuousA2CBase.__init__(self, base_name, config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/a2c_common.py", line 973, in __init__
    A2CBase.__init__(self, base_name, config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/a2c_common.py", line 171, in __init__
    assert(self.batch_size % self.minibatch_size == 0)
AssertionError

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

However, if I set pipeline to CPU, it works fine.

toni.sm · October 31, 2021, 6:22pm

Hi @jimingre

You are getting this exception (in rl_games) because the batch_size is not a multiple of minibatch_size.

I checked the values for 100, 512 and 1024 (default) environments for the Ant tasks:

1024 environments (default) [yes]
batch_size = 16384
minibatch_size = 8192
512 environments (minimum amount of environment to match the condition) [yes]
batch_size = 8192
minibatch_size = 8192
100 environments [no]
batch_size = 1600
minibatch_size = 8192

Solutions:

try with 512 environments
no idea (I have not explored rl_games in depth). So, you can open a new issue in rl_games repository 😅

Note: to test your error I had to install rl_games from pip (pip install rl-games==1.0.2) because the direct installation form GitHub (latest version) gives some errors related to the configuration before reaching the assert line 🤷‍♂️

jimingre · October 31, 2021, 6:41pm

Hi @toni.sm

I wonder how I can get access to the old version of IsaacGymEnvs?

Fix: For Ant, I changed to minibatch size to 8192 in configuration and now it works fine. But not for Anymal, it still has the problem of running out of memory.

toni.sm · October 31, 2021, 6:49pm

Hi @jimingre

Could you please try to install rl_games from pip using the following command?
```
pip install rl-games==1.0.2
# or
python -m pip install rl-games==1.0.2
```
You can test your current rl_games version using
```
python -m pip show rl_games
```
This version worked for me with 512 environments.
Regarding the old version of IsaacGymEnvs, I don’t know how or if it is possible (I’m just a common user). The Isaac-Gym team could provide you with more information on this topic.

jimingre · October 31, 2021, 6:55pm

Hi @toni.sm

Thank you for that information.

My rl-games version is 1.1.3. Do you suggest me downgrading to 1.0.2?

toni.sm · October 31, 2021, 6:58pm

Hi @jimingre

Yes. What was there to lose? You can always reinstall the latest version.
Well, at the end I don’t know anything about rl_games… 😅

jimingre · October 31, 2021, 7:02pm

Hi @toni.sm

There are some dependency issues. Maybe I will start a new thread on rl-games. Thanks for your help anyway!

user38580 · January 16, 2022, 5:34pm

hello,nice to meet you.
Did you solve it,I have same problem,I am using RTX 3060 6GB too.

jimingre · January 18, 2022, 4:23pm

Hi,

You might want to refer to this post which solves my problem

user38580 · January 20, 2022, 4:28am

Thanks a lot!!!

dvogureckiy99 · April 13, 2022, 9:36pm

I have got train.py: error: unrecognized arguments: --num_envs 512

toni.sm · April 13, 2022, 9:39pm

Hi @dvogureckiy99

--num_envs is for Isaac Gym preview 2.

In the Isaac Gym preview 3 environments you should set num_envs=NUM_ENVS to select the number of environments to use

dvogureckiy99 · April 13, 2022, 9:46pm

Thank you, but I have error, can you help ? see my post

[Error] [carb.gym.plugin] Gym cuda error: no kernel image is available for execution on the device: ../../../source/plugins/carb/gym/impl/Gym/GymPhysXCuda.cu: 991
[Error] [carb.gym.plugin] Gym cuda error: no kernel image is available for execution on the device: ../../../source/plugins/carb/gym/impl/Gym/GymPhysXCuda.cu: 1010
[Error] [carb.gym.plugin] Gym cuda error: no kernel image is available for execution on the device: ../../../source/plugins/carb/gym/impl/Gym/GymPhysXCuda.cu: 926
[Error] [carb.gym.plugin] Failed to fill rigid body state tensor
[Error] [carb.gym.plugin] Gym cuda error: no kernel image is available for execution on the device: ../../../source/plugins/carb/gym/impl/Gym/GymPhysXCuda.cu: 991
[Error] [carb.gym.plugin] Gym cuda error: no kernel image is available for execution on the device: ../../../source/plugins/carb/gym/impl/Gym/GymPhysXCuda.cu: 1010
fps step: 18839.4 fps step and policy inference: 17392.6  fps total: 15201.0
=> saving checkpoint 'runs/Cartpole/nn/last_Cartpoleep101rew[497.33].pth'
MAX EPOCHS NUM!

toni.sm · April 13, 2022, 9:56pm

Hi @dvogureckiy99

I am just a user and I don’t know anything about the programming/implementation of Isaac Gym :(

Maybe these posts, in which similar problems are reported, can be of help to you:

Gym cuda error: no kernel image is available for execution on the device
Gym cuda error: no kernel image is available for execution on the device, pt. 2
Error on running 1080_balls_of_solitude and other simulations examples when using GPU pipeline

hamid.ustc · January 10, 2024, 2:27am

Hi everyone! I have the same below error when i tried to run the ( legged_gym/scripts/train.py --task=anymal_c_flat ) file. How to resolve the below error? [Error] [carb.gym.plugin] Gym cuda error: out of memory: …/…/…/source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 1721