Play a checkpoint file without using gpu at all? to avoid memory errors

i would like to play checkpoints in one terminal while running training in another. when i first tried this i got CUDA error: out of memory, so i tried “playing” the model without using the gpu:

python train.py task=Ant test=True checkpoint=cp.pth num_envs=4 sim_device=cpu rl_device=cpu pipeline=cpu

sadly this still sometimes causes the CUDA memory error.

File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 29, in <lambda>
    self.player_factory.register_builder('a2c_continuous', lambda **kwargs : players.PpoPlayerContinuous(**kwargs))
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/players.py", line 28, in __init__
    self.actions_low = torch.from_numpy(self.action_space.low.copy()).float().to(self.device)
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/torch/cuda/__init__.py", line 170, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA error: out of memory
stuart@hp ~/r/I/isaacgymenvs (main) [1]> python train.py task=Ant test=True checkpoint=cp.pth num_envs=4 sim_device=cpu rl_device=cpu pipeline=cpu

is there any way to play the model without touching the GPU at all?

incidentally is there any way to reset the CUDA memory without a full reboot? i tried sudo rmmod nvidia_uvm ; sudo modprobe nvidia_uvm from Reset GPU without restarting linux? - #6 by Harsha - Part 2 & Alumni (2018) - Deep Learning Course Forums, but got some errors that nvidia_uvm was in use and couldn’t figure a way around it

thank you for reading!

nevermind. i was running out of memory because i was terminating training using ctrl-z instead of ctrl-c which was leaving python running in the background. i misread some docs somewhere thinking it suggested ctrl-z. solved by running nvidia-smi and seeing the lingering python processes there.

i’m still a little curious as to why cuda gets initialized even when everything is set to use CPU but it doesn’t really matter for me right now

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.