Large number of samples for dataset conversion for PointPillars

silentjcr · January 5, 2024, 9:02am

I’ve been training PointPillarNets using the custom dataset containing self-collected pointcloud and
annotation data and the exist KITTI dataset. Currently I have the training dataset containing over 30K samples.

I ran pointpillars.ipynb on firefox and all the samples mentioned above had to be processed by running
the following snippet:

!tao model pointpillars dataset_convert -e $SPECS_DIR/pointpillars.yaml

The problem I encountered is that I intended to add more samples and then run dataset_convert so as to finetune my model. I encountered the " Your Tab Just Crashed" several times.

I then reduced the number of samples back to 30K and things went well again.
I traced the system memory usage using the htop command and noticed that the tab crashed when the memory was completely used (64 GB memory for my training machine). The memory usage kept rising when dataset_convert was running until the tab crashed.

If this is exactly what made the tab crash, that means I can’t add more samples anymore due to the limited memory capacity.

Is there any workaround or way to avoid full memory usage?

Morganh · January 6, 2024, 10:01am

You can try to increase the “SWAP” Memory in the Linux system. Refer to Issue while converting maskrcnn model to trt from etlt on Laptops - #23 by alaapdhall79

silentjcr · January 10, 2024, 8:26am

Hi, sorry for the late reply.
I just tried increasing the swap memory size, which was set to 2GB by default, to 100GB, and tried dataset_convert again.

At first it did process more samples than it had done before swap memory was increased, but it seemed that system still tend to use Mem first and when Mem was fully used the said firefox tab still crashed even if there were still lots of spaces of swap memory available. Only 11GB was used during the process.

I then once again used swapoff and swapon commands, but this time I set the priority to some positive numbers to see if swap memory would be used first. As a result, the swap memory wasn’t USED AT ALL…

UPDATED 2014 1/10 16:35
The system started using swap spaces after roughly 22 out of 62.5 GB memory spaces were occupied, but still very little.

Only 8.25MB of the swap spaces were occpupied when near 37 GB memory spaces were taken by dataset_convert.

silentjcr · January 10, 2024, 10:52am

That’s weird.

Instead of running dataset_convert via firefox. I directly ran it again without changing anything from the terminal and the memory wasn’t fully occupied as it used to. The memory consumption still increased over time but not as fast as it used to and I successfully converted 45K point cloud samples.

Morganh · January 11, 2024, 1:58am

Can you share the command how you run in the terminal?

More, could you also try to run as below?

$ docker run --runtime=nvidia -it --rm --shm-size 32G nvcr.io/nvidia/tao/tao-toolkit:5.2.0-pyt2.1.0 /bin/bash

silentjcr · January 11, 2024, 2:16am

The command I’ve been talking about in recent days on pointpillars.ipynb is:

!tao model pointpillars dataset_convert -e $SPECS_DIR/pointpillars.yaml,
where SPECS_DIR represents the path: /workspace/tao-experiments/pointpillars/specs

I literally ran the same command from the terminal in the tao conda environment:

With the same amount of data to be processed, same content in the .yaml file, the same command resultd in different memory comsumptions just because it was run differently, one from .ipynb on firefox and one directly from terminal.

As for the command you have mentioned, I tried it but saw this error message as a result:

docker: Error response from daemon: unknown or invalid runtime name: nvidia.

Morganh · January 11, 2024, 2:21am

Please run below to install nvidia-docker.

sudo apt-get install nvidia-docker2
sudo systemctl restart docker.service

silentjcr · January 11, 2024, 2:33am

Still got the same error message after having run the 2 lines of commands.

Morganh · January 11, 2024, 2:34am

Could you please open a new terminal and retry?

Morganh · January 11, 2024, 2:36am

Please also try
sudo apt install -y nvidia-docker2
sudo systemctl daemon-reload
sudo systemctl restart docker

Refer to Triton Server can't run with GPU - #15 by Morganh

silentjcr · January 11, 2024, 2:40am

Morganh · January 11, 2024, 2:44am

Could you open a new terminal and retry?
BTW, I correct the command as below.
$ docker run --runtime=nvidia -it --rm --shm-size 32G nvcr.io/nvidia/tao/tao-toolkit:5.2.0-pyt2.1.0 /bin/bash

silentjcr · January 11, 2024, 2:48am

Still got the same error response.

Morganh · January 11, 2024, 2:49am

Could you share the result of
$ cat /etc/docker/daemon.json

silentjcr · January 11, 2024, 2:53am

Spent some time locating the file but couldn’t find it.

“cat: /etc/docker/daemon.json: No such file or directory”

Morganh · January 11, 2024, 2:55am

OK, please install Nvidia Docker runtime as well. Refer to Docker Error - Unknown or Invalid Runtime Name: Nvidia · Issue #132 · NVIDIA-ISAAC-ROS/isaac_ros_visual_slam · GitHub.

# Install Nvidia Docker runtime
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo apt-get install -y nvidia-container-runtime
sudo systemctl restart docker

silentjcr · January 11, 2024, 3:13am

Ran all the commands, started a new terminal and tried again. Still got the same error message as a result.

Morganh · January 11, 2024, 3:15am

Is below available?
$ ls /usr/bin/nvidia-container-runtime

silentjcr · January 11, 2024, 3:21am

I think so.

Morganh · January 11, 2024, 3:22am

Please do below as well.
$ sudo apt install nvidia-container-toolkit
$ sudo apt-get install nvidia-docker2
$ sudo pkill -SIGHUP dockerd

Topic		Replies	Views
Issues Running Jupyter Notebook While setting up TAO TAO Toolkit	9	86	March 10, 2025
TAO classification /bin/sh: 1: pip3: not found TAO Toolkit	80	2204	July 26, 2022
TAO Toolkit - FPENet - Dataset_Convert error TAO Toolkit	14	721	October 6, 2023
Faster_RCNN sample dataset_convert command raise 'Docker instantiation failed with error: 500 Server Error: Internal Server Error' TAO Toolkit	22	1180	November 30, 2021
Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct TAO Toolkit deepstream	70	1705	July 10, 2023
Mask-RCNN int8 Version Results in Poor Performance TAO Toolkit	37	1007	July 6, 2022
Detectnet_v2 notebook stuck at tfrecords conversion step TAO Toolkit	17	51	October 30, 2024
"Unable to find image 'nvidia/cuda" While Installing TAO Toolkit TAO Toolkit cuda , docker , ai-training , wsl , training , installation , tao , rtx	12	6989	November 2, 2023
Error in Generating TFrecords for yolov4 TAO Toolkit	38	1227	May 17, 2022
Grounding dino : out of memory TAO Toolkit	6	82	January 22, 2025

Large number of samples for dataset conversion for PointPillars

Related topics