i would like to play checkpoints in one terminal while running training in another. when i first tried this i got
CUDA error: out of memory, so i tried “playing” the model without using the gpu:
python train.py task=Ant test=True checkpoint=cp.pth num_envs=4 sim_device=cpu rl_device=cpu pipeline=cpu
sadly this still sometimes causes the CUDA memory error.
File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 29, in <lambda> self.player_factory.register_builder('a2c_continuous', lambda **kwargs : players.PpoPlayerContinuous(**kwargs)) File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/players.py", line 28, in __init__ self.actions_low = torch.from_numpy(self.action_space.low.copy()).float().to(self.device) File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/torch/cuda/__init__.py", line 170, in _lazy_init torch._C._cuda_init() RuntimeError: CUDA error: out of memory stuart@hp ~/r/I/isaacgymenvs (main) > python train.py task=Ant test=True checkpoint=cp.pth num_envs=4 sim_device=cpu rl_device=cpu pipeline=cpu
is there any way to
play the model without touching the GPU at all?
incidentally is there any way to reset the CUDA memory without a full reboot? i tried
sudo rmmod nvidia_uvm ; sudo modprobe nvidia_uvm from Reset GPU without restarting linux? - #6 by Harsha - Part 2 & Alumni (2018) - Deep Learning Course Forums, but got some errors that nvidia_uvm was in use and couldn’t figure a way around it
thank you for reading!