Out of memory during training

I am following the “Hello AI world” of Nvidia on my new Jetson-Nano dev kit (4GB). in the 3rd video ( here) , a dog/cat training is done, on top of the existing network. The command is:
python3 train.py --model-dir=models/cat_dog data/cat_dog

, and it aborts with “killed” message.
When I add the flags that are supposed to reduce memory needs " --batch-size=4 --workers=1 --epochs=1" it starts running, yet aborting with “OSError: [Errno 12] Cannot allocate memory”`

Yet in the video it runs well, even though it runs on the 2GB model, while I use the 4GB model.
In my case, I also terminated all other applications.
Any idea why can’t it finish the task?
Can I execute the training outside the Jetson Nano?

Hi,

Could you try to set --batch-size 2 --workers 1 to decrease the memory usage?
Thanks.

1 Like

And, refer following posts as well:

1 Like

Thanks. I restarted the system and this time the run passed, so I assume my memory usage is on the edge. I will try that later,

Thanks. Somehow the run passed after restart, so I am progressing now. I will try that next time.

Hi, byigal

Is this issue fixed when restarting the program?
Thanks.

I restarted few times. At first it wasn’t solved. Then it went well after making sure I do nothing but this. it made me thinking that the memory needed is on the edge.

Hi,

You can monitor the memory status with tegrastats.
If the usage is close to the maximum, maybe you can even lower the batch size to 1.

Thanks.

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.