Gym cuda error: running out of memory

jimingre · October 31, 2021, 5:47pm

Thanks for you advice. I changed num_envs to 100 but the training is still stopped abruptly.

task_name: Ant
experiment: 
num_envs: 100
seed: 42
torch_deterministic: False
max_iterations: 
physics_engine: physx
pipeline: gpu
sim_device: cuda:0
rl_device: cuda:0
graphics_device_id: 0
num_threads: 4
solver_type: 1
num_subscenes: 4
test: False
checkpoint: 
multi_gpu: False
headless: False
Setting seed: 42
Started to train
Exact experiment name requested from command line: Ant
/home/jrenaf/.local/lib/python3.6/site-packages/gym/spaces/box.py:74: UserWarning: WARN: Box bound precision lowered by casting to float32
  "Box bound precision lowered by casting to {}".format(self.dtype)
[Warning] [carb.gym.plugin] useGpu is set, forcing single scene (0 subscenes)
Not connected to PVD
+++ Using GPU PhysX
Physics Engine: PhysX
Physics Device: cuda:0
GPU Pipeline: enabled
num envs 100 env spacing 5
Box([-1. -1. -1. -1. -1. -1. -1. -1.], [1. 1. 1. 1. 1. 1. 1. 1.], (8,), float32) Box([-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf], (60,), float32)
Env info:
{'action_space': Box([-1. -1. -1. -1. -1. -1. -1. -1.], [1. 1. 1. 1. 1. 1. 1. 1.], (8,), float32), 'observation_space': Box([-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf], (60,), float32)}
Error executing job with overrides: ['task=Ant', 'num_envs=100']
Traceback (most recent call last):
  File "train.py", line 112, in launch_rlg_hydra
    'play': cfg.test,
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 139, in run
    self.run_train()
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 122, in run_train
    agent = self.algo_factory.create(self.algo_name, base_name='run', config=self.config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/object_factory.py", line 15, in create
    return builder(**kwargs)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/torch_runner.py", line 23, in <lambda>
    self.algo_factory.register_builder('a2c_continuous', lambda **kwargs : a2c_continuous.A2CAgent(**kwargs))
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/algos_torch/a2c_continuous.py", line 18, in __init__
    a2c_common.ContinuousA2CBase.__init__(self, base_name, config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/a2c_common.py", line 973, in __init__
    A2CBase.__init__(self, base_name, config)
  File "/home/jrenaf/.local/lib/python3.6/site-packages/rl_games/common/a2c_common.py", line 171, in __init__
    assert(self.batch_size % self.minibatch_size == 0)
AssertionError

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

However, if I set pipeline to CPU, it works fine.