i would like to play checkpoints in one terminal while running training in another. when i first tried this i got CUDA error: out of memory
, so i tried “playing” the model without using the gpu:
python train.py task=Ant test=True checkpoint=cp.pth num_envs=4 sim_device=cpu rl_device=cpu pipeline=cpu
sadly this still sometimes causes the CUDA memory error.
File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 29, in <lambda>
self.player_factory.register_builder('a2c_continuous', lambda **kwargs : players.PpoPlayerContinuous(**kwargs))
File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/players.py", line 28, in __init__
self.actions_low = torch.from_numpy(self.action_space.low.copy()).float().to(self.device)
File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/torch/cuda/__init__.py", line 170, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA error: out of memory
stuart@hp ~/r/I/isaacgymenvs (main) [1]> python train.py task=Ant test=True checkpoint=cp.pth num_envs=4 sim_device=cpu rl_device=cpu pipeline=cpu
is there any way to play
the model without touching the GPU at all?
incidentally is there any way to reset the CUDA memory without a full reboot? i tried sudo rmmod nvidia_uvm ; sudo modprobe nvidia_uvm
from Reset GPU without restarting linux? - #6 by Harsha - Part 2 & Alumni (2018) - Deep Learning Course Forums, but got some errors that nvidia_uvm was in use and couldn’t figure a way around it
thank you for reading!