Tao model error

iamliuxin2020 · October 16, 2024, 4:42pm

Hi there, I am a beginner to Tao toolkit. I am following the insttruction of the OCDnet notebook and correctly 1.set up the env variables and map drives 2. Installing the TAO launcher 3. set up the trainning spec (actaully it is as provided ).

Please provide the following information when requesting support.

• Hardware: RTX 3060ti
• Network Type : ocdnet_vtrainable_resnet18_v1.0
• TLT Version Configuration of the TAO Toolkit Instance
task_group: [‘model’, ‘dataset’, ‘deploy’]
format_version: 3.0
toolkit_version: 5.5.0
published_date: 08/26/2024

• Training spec file(If have, please share here)
model:
load_pruned_graph: False
pruned_graph_path: ‘/results/prune/pruned_0.1.pth’
pretrained_model_path: ‘/data/ocdnet/ocdnet_deformable_resnet18.pth’
backbone: deformable_resnet18

train:
results_dir: /results/train
num_gpus: 1
num_epochs: 30
#resume_training_checkpoint_path: ‘/results/train/resume.pth’
checkpoint_interval: 1
validation_interval: 1
trainer:
clip_grad_norm: 5.0

optimizer:
type: Adam
args:
lr: 0.001

lr_scheduler:
type: WarmupPolyLR
args:
warmup_epoch: 3

post_processing:
type: SegDetectorRepresenter
args:
thresh: 0.3
box_thresh: 0.55
max_candidates: 1000
unclip_ratio: 1.5

metric:
type: QuadMetric
args:
is_output_polygon: false

dataset:
train_dataset:
data_path: [‘/data/ocdnet/train’]
args:
pre_processes:
- type: IaaAugment
args:
- {‘type’:Fliplr, ‘args’:{‘p’:0.5}}
- {‘type’: Affine, ‘args’:{‘rotate’:[-10,10]}}
- {‘type’:Resize,‘args’:{‘size’:[0.5,3]}}
- type: EastRandomCropData
args:
size: [640,640]
max_tries: 50
keep_ratio: true
- type: MakeBorderMap
args:
shrink_ratio: 0.4
thresh_min: 0.3
thresh_max: 0.7
- type: MakeShrinkMap
args:
shrink_ratio: 0.4
min_text_size: 8

    img_mode: BGR
    filter_keys: [img_path,img_name,text_polys,texts,ignore_tags,shape]
    ignore_tags: ['*', '###']
  loader:
    batch_size: 20
    pin_memory: true
    num_workers: 12

validate_dataset:
data_path: [‘/data/ocdnet/test’]
args:
pre_processes:
- type: Resize2D
args:
short_size:
- 1280
- 736
resize_text_polys: true
img_mode: BGR
filter_keys:
ignore_tags: [‘*’, ‘###’]
loader:
batch_size: 1
pin_memory: false
num_workers: 1

The mount.json is as follows:

{
“Mounts”: [
{
“source”: “/home/cc/Documents/tao-ocd-dir”,
“destination”: “/workspace/tao-experiments”
},
{
“source”: “/home/cc/Documents/tao-ocd-dir/data/ocdnet”,
“destination”: “/data/ocdnet”
},
{
“source”: “/home/cc/Documents/tao_tutorials/notebooks/tao_launcher_starter_kit/ocdnet/specs”,
“destination”: “/specs”
},
{
“source”: “/home/cc/Documents/tao-ocd-dir/ocdnet/results”,
“destination”: “/results”
}
],
“DockerOptions”: {
“shm_size”: “16G”,
“ulimits”: {
“memlock”: -1,
“stack”: 67108864
},
“user”: “1000:1000”,
“network”: “host”
}
}

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

When I first time run the cell below, I found that the tao is pulling a lot of conatainers, however my notebook crash, then I am not quite sure if it pulled all the containers needed.

!tao model ocdnet train
-e $SPECS_DIR/train.yaml
results_dir=$RESULTS_DIR \ model.pretrained_model_path=$RESULTS_DIR/pretrained_ocdnet/ocdnet_vtrainable_resnet18_v1.0/ocdnet_deformable_resnet18.pth

But when I reenterd the notebook and run this command, I found that it no longer download any containers but just prompt the warning and error below:

2024-10-17 00:33:21,551 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-10-17 00:33:21,593 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-10-17 00:33:21,601 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
Error response from daemon: No such container: bddc7ffc435e4e9cea1521c61dd1aeb52c8151e0795e5d7c8c3c65d02460b150
2024-10-17 00:33:22,387 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

What should I do?

Morganh · October 17, 2024, 3:32am

You can open a terminal to check if the docker is pulled successfully.
You can run
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt /bin/bash

iamliuxin2020 · October 17, 2024, 11:08am

Then I got this:

===========================
=== TAO Toolkit PyTorch ===

NVIDIA Release 5.5.0-PyT (build 88113656)
TAO Toolkit Version 5.5.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for TAO Toolkit. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 …

root@29998351c88a:/opt/nvidia/tools#

I am gussing if it is because that I did not install nv-docker2? I remember that when I am installing Tao launcher it shows this prompt but I ignore it.

iamliuxin2020 · October 17, 2024, 11:21am

Sorry, I just tried that. I installed the nvdia-docker 2 and restart the notebook and run the tao train. It still failed.

2024-10-17 19:19:55,629 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-10-17 19:19:55,676 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-10-17 19:19:55,685 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
Error response from daemon: No such container: 64e3e92d0c70c6146eededa5458f6aa0bf5e6466134407c01d114e36c4ef5a49
2024-10-17 19:19:56,395 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Then I did some check regading my docker:

(launcher) cc@CC-desktop-7921:~$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
(launcher) cc@CC-desktop-7921:~$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/nvidia/tao/tao-toolkit 5.5.0-pyt 98766d6ac7d2 8 weeks ago 25.9GB
hello-world latest d2c94e258dcb 17 months ago 13.3kB
(launcher) cc@CC-desktop-7921:~$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:

The Docker client contacted the Docker daemon.

The Docker daemon pulled the “hello-world” image from the Docker Hub.
(amd64)

The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.

The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/

For more examples and ideas, visit:
Get started | Docker Docs

(launcher) cc@CC-desktop-7921:~$

It seems that the docker status is correct, so and the rootless permission.
It did pulled the images then I dont know why I cannot train the model.

Here shows the list of the tao launcher:

(launcher) cc@CC-desktop-7921:~$ tao list
============== ================== =========
container_id container_status command
============== ================== =========
============== ================== =========
It seems that there are some containers but do not list their id.

Morganh · October 18, 2024, 3:01am

You can run with either of below ways.

TAO launcher
docker run . Above log shows that you already run into the 5.5.0-PyT docker successfully.

iamliuxin2020 · October 18, 2024, 4:08am

Okay, I will try to use it directly within the container.
Then I wonder why the tao launcher does not work?
Does it be relevant to the python version?

On the TAO 5.50 toolkit it shows that the python requirement is >3.10 while within the notebook of the ocdnet, it shows the python version should be >3.6 <=3.10.

At very beginning, I tried to use python 3.10 but the tao model train prompt error caused by Breaks with requests 2.32.0: Not supported URL scheme http+docker, so I degrade my python to 3.8, then it works. I am not sure if the bug raised here are relevent to the python version.

iamliuxin2020 · October 20, 2024, 1:35pm

I am trying to use the container directly to train my model, is there any documentation regarding which parameter should I convey to each model/container?

Morganh · October 21, 2024, 2:02am

Yes, you can.
You can use docker run.
After login in the docker, you can run training without “tao” in the beginning.
For example,
$ ocdnet train xxx

Morganh · October 21, 2024, 2:06am

Another hint is from Tao toolkit observations - #63 by foreverneilyoung.

system · November 4, 2024, 2:06am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Mask-RCNN int8 Version Results in Poor Performance TAO Toolkit	37	1005	July 6, 2022
LPRNet Error TAO Toolkit	13	228	June 19, 2024
TAO Toolkit - FPENet - Dataset_Convert error TAO Toolkit	14	719	October 6, 2023
Error in TAO-Toolkit while training TAO Toolkit	15	1511	July 6, 2022
Docker - No such container TAO Toolkit	7	60	March 10, 2025
Detectnet_v2 notebook stuck at tfrecords conversion step TAO Toolkit	17	51	October 30, 2024
OCDNet Tao Model Zoo TAO Toolkit jetson	7	39	October 22, 2024
Tao toolkit Error while fetching server API version TAO Toolkit	19	1892	June 15, 2023
Tao pre-trained yolo4tiny - AssertionError: Must have more boxes than clusters TAO Toolkit	54	2278	January 21, 2022
No CUDA-capable device is detected - yolov4 TAO Toolkit	10	131	August 16, 2024

Tao model error

=========================== === TAO Toolkit PyTorch ===

Related topics

===========================
=== TAO Toolkit PyTorch ===