I’ve been training PointPillarNets using the custom dataset containing self-collected pointcloud and
annotation data and the exist KITTI dataset. Currently I have the training dataset containing over 30K samples.
I ran pointpillars.ipynb on firefox and all the samples mentioned above had to be processed by running
the following snippet:
!tao model pointpillars dataset_convert -e $SPECS_DIR/pointpillars.yaml
The problem I encountered is that I intended to add more samples and then run dataset_convert so as to finetune my model. I encountered the " Your Tab Just Crashed" several times.
I then reduced the number of samples back to 30K and things went well again.
I traced the system memory usage using the htop command and noticed that the tab crashed when the memory was completely used (64 GB memory for my training machine). The memory usage kept rising when dataset_convert was running until the tab crashed.
If this is exactly what made the tab crash, that means I can’t add more samples anymore due to the limited memory capacity.
Is there any workaround or way to avoid full memory usage?
Hi, sorry for the late reply.
I just tried increasing the swap memory size, which was set to 2GB by default, to 100GB, and tried dataset_convert again.
At first it did process more samples than it had done before swap memory was increased, but it seemed that system still tend to use Mem first and when Mem was fully used the said firefox tab still crashed even if there were still lots of spaces of swap memory available. Only 11GB was used during the process.
I then once again used swapoff and swapon commands, but this time I set the priority to some positive numbers to see if swap memory would be used first. As a result, the swap memory wasn’t USED AT ALL…
UPDATED 2014 1/10 16:35
The system started using swap spaces after roughly 22 out of 62.5 GB memory spaces were occupied, but still very little.
Only 8.25MB of the swap spaces were occpupied when near 37 GB memory spaces were taken by dataset_convert.
Instead of running dataset_convert via firefox. I directly ran it again without changing anything from the terminal and the memory wasn’t fully occupied as it used to. The memory consumption still increased over time but not as fast as it used to and I successfully converted 45K point cloud samples.
The command I’ve been talking about in recent days on pointpillars.ipynb is:
!tao model pointpillars dataset_convert -e $SPECS_DIR/pointpillars.yaml,
where SPECS_DIR represents the path: /workspace/tao-experiments/pointpillars/specs
I literally ran the same command from the terminal in the tao conda environment:
With the same amount of data to be processed, same content in the .yaml file, the same command resultd in different memory comsumptions just because it was run differently, one from .ipynb on firefox and one directly from terminal.
As for the command you have mentioned, I tried it but saw this error message as a result:
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
Could you open a new terminal and retry?
BTW, I correct the command as below.
$ docker run --runtime=nvidia -it --rm --shm-size 32G nvcr.io/nvidia/tao/tao-toolkit:5.2.0-pyt2.1.0 /bin/bash