Play a checkpoint file without using gpu at all? to avoid memory errors

stuartc842 · April 27, 2022, 7:56pm

i would like to play checkpoints in one terminal while running training in another. when i first tried this i got CUDA error: out of memory, so i tried “playing” the model without using the gpu:

python train.py task=Ant test=True checkpoint=cp.pth num_envs=4 sim_device=cpu rl_device=cpu pipeline=cpu

sadly this still sometimes causes the CUDA memory error.

File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 29, in <lambda>
    self.player_factory.register_builder('a2c_continuous', lambda **kwargs : players.PpoPlayerContinuous(**kwargs))
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/players.py", line 28, in __init__
    self.actions_low = torch.from_numpy(self.action_space.low.copy()).float().to(self.device)
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/torch/cuda/__init__.py", line 170, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA error: out of memory
stuart@hp ~/r/I/isaacgymenvs (main) [1]> python train.py task=Ant test=True checkpoint=cp.pth num_envs=4 sim_device=cpu rl_device=cpu pipeline=cpu

is there any way to play the model without touching the GPU at all?

incidentally is there any way to reset the CUDA memory without a full reboot? i tried sudo rmmod nvidia_uvm ; sudo modprobe nvidia_uvm from Reset GPU without restarting linux? - #6 by Harsha - Part 2 & Alumni (2018) - Deep Learning Course Forums, but got some errors that nvidia_uvm was in use and couldn’t figure a way around it

thank you for reading!

stuartc842 · April 27, 2022, 9:05pm

nevermind. i was running out of memory because i was terminating training using ctrl-z instead of ctrl-c which was leaving python running in the background. i misread some docs somewhere thinking it suggested ctrl-z. solved by running nvidia-smi and seeing the lingering python processes there.

i’m still a little curious as to why cuda gets initialized even when everything is set to use CPU but it doesn’t really matter for me right now

system · May 11, 2022, 9:06pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GPU memory is empty, but CUDA out of memory error occurs CUDA Programming and Performance cuda	5	23138	September 19, 2024
Isaac gym task=Ant pipeline=cpu CUDA error: out of memor Isaac Gym isaacsim	0	566	April 13, 2022
[Error] Not enough memory for command submission while running a checkpoint Isaac Gym	0	381	January 21, 2024
Gym cuda error: running out of memory Isaac Gym	16	6254	January 10, 2024
CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory CUDA Programming and Performance cuda , gpu-computing	1	1074	December 13, 2023
Checkpoint in GPU Other Tools	0	371	August 13, 2020
Checkpoint in GPU CUDA Developer Tools tegra_tools	0	380	August 12, 2020
Any checkpoint tool for CUDA applications CUDA Programming and Performance	0	326	August 13, 2020
Run Isaac Gym in CPU mode without a CUDA capable GPU on the device Isaac Gym	2	3281	November 18, 2022
Failed to create a PhysX CUDA Context Manager. Falling back to CPU Physics Modeling (closed) cuda	2	2730	October 12, 2021

Play a checkpoint file without using gpu at all? to avoid memory errors

Related topics