Tao model error

Hi there, I am a beginner to Tao toolkit. I am following the insttruction of the OCDnet notebook and correctly 1.set up the env variables and map drives 2. Installing the TAO launcher 3. set up the trainning spec (actaully it is as provided ).

Please provide the following information when requesting support.

• Hardware: RTX 3060ti
• Network Type : ocdnet_vtrainable_resnet18_v1.0
• TLT Version Configuration of the TAO Toolkit Instance
task_group: [‘model’, ‘dataset’, ‘deploy’]
format_version: 3.0
toolkit_version: 5.5.0
published_date: 08/26/2024

• Training spec file(If have, please share here)
model:
load_pruned_graph: False
pruned_graph_path: ‘/results/prune/pruned_0.1.pth’
pretrained_model_path: ‘/data/ocdnet/ocdnet_deformable_resnet18.pth’
backbone: deformable_resnet18

train:
results_dir: /results/train
num_gpus: 1
num_epochs: 30
#resume_training_checkpoint_path: ‘/results/train/resume.pth’
checkpoint_interval: 1
validation_interval: 1
trainer:
clip_grad_norm: 5.0

optimizer:
type: Adam
args:
lr: 0.001

lr_scheduler:
type: WarmupPolyLR
args:
warmup_epoch: 3

post_processing:
type: SegDetectorRepresenter
args:
thresh: 0.3
box_thresh: 0.55
max_candidates: 1000
unclip_ratio: 1.5

metric:
type: QuadMetric
args:
is_output_polygon: false

dataset:
train_dataset:
data_path: [‘/data/ocdnet/train’]
args:
pre_processes:
- type: IaaAugment
args:
- {‘type’:Fliplr, ‘args’:{‘p’:0.5}}
- {‘type’: Affine, ‘args’:{‘rotate’:[-10,10]}}
- {‘type’:Resize,‘args’:{‘size’:[0.5,3]}}
- type: EastRandomCropData
args:
size: [640,640]
max_tries: 50
keep_ratio: true
- type: MakeBorderMap
args:
shrink_ratio: 0.4
thresh_min: 0.3
thresh_max: 0.7
- type: MakeShrinkMap
args:
shrink_ratio: 0.4
min_text_size: 8

    img_mode: BGR
    filter_keys: [img_path,img_name,text_polys,texts,ignore_tags,shape]
    ignore_tags: ['*', '###']
  loader:
    batch_size: 20
    pin_memory: true
    num_workers: 12

validate_dataset:
data_path: [‘/data/ocdnet/test’]
args:
pre_processes:
- type: Resize2D
args:
short_size:
- 1280
- 736
resize_text_polys: true
img_mode: BGR
filter_keys:
ignore_tags: [‘*’, ‘###’]
loader:
batch_size: 1
pin_memory: false
num_workers: 1

The mount.json is as follows:

{
“Mounts”: [
{
“source”: “/home/cc/Documents/tao-ocd-dir”,
“destination”: “/workspace/tao-experiments”
},
{
“source”: “/home/cc/Documents/tao-ocd-dir/data/ocdnet”,
“destination”: “/data/ocdnet”
},
{
“source”: “/home/cc/Documents/tao_tutorials/notebooks/tao_launcher_starter_kit/ocdnet/specs”,
“destination”: “/specs”
},
{
“source”: “/home/cc/Documents/tao-ocd-dir/ocdnet/results”,
“destination”: “/results”
}
],
“DockerOptions”: {
“shm_size”: “16G”,
“ulimits”: {
“memlock”: -1,
“stack”: 67108864
},
“user”: “1000:1000”,
“network”: “host”
}
}

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

When I first time run the cell below, I found that the tao is pulling a lot of conatainers, however my notebook crash, then I am not quite sure if it pulled all the containers needed.

!tao model ocdnet train
-e $SPECS_DIR/train.yaml
results_dir=$RESULTS_DIR \ model.pretrained_model_path=$RESULTS_DIR/pretrained_ocdnet/ocdnet_vtrainable_resnet18_v1.0/ocdnet_deformable_resnet18.pth

But when I reenterd the notebook and run this command, I found that it no longer download any containers but just prompt the warning and error below:

2024-10-17 00:33:21,551 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-10-17 00:33:21,593 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-10-17 00:33:21,601 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
Error response from daemon: No such container: bddc7ffc435e4e9cea1521c61dd1aeb52c8151e0795e5d7c8c3c65d02460b150
2024-10-17 00:33:22,387 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

What should I do?

You can open a terminal to check if the docker is pulled successfully.
You can run
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt /bin/bash

Then I got this:

===========================
=== TAO Toolkit PyTorch ===

NVIDIA Release 5.5.0-PyT (build 88113656)
TAO Toolkit Version 5.5.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for TAO Toolkit. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 …

root@29998351c88a:/opt/nvidia/tools#

I am gussing if it is because that I did not install nv-docker2? I remember that when I am installing Tao launcher it shows this prompt but I ignore it.

Sorry, I just tried that. I installed the nvdia-docker 2 and restart the notebook and run the tao train. It still failed.

2024-10-17 19:19:55,629 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-10-17 19:19:55,676 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-10-17 19:19:55,685 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
Error response from daemon: No such container: 64e3e92d0c70c6146eededa5458f6aa0bf5e6466134407c01d114e36c4ef5a49
2024-10-17 19:19:56,395 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Then I did some check regading my docker:

(launcher) cc@CC-desktop-7921:~$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
(launcher) cc@CC-desktop-7921:~$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/nvidia/tao/tao-toolkit 5.5.0-pyt 98766d6ac7d2 8 weeks ago 25.9GB
hello-world latest d2c94e258dcb 17 months ago 13.3kB
(launcher) cc@CC-desktop-7921:~$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:

  1. The Docker client contacted the Docker daemon.
  2. The Docker daemon pulled the “hello-world” image from the Docker Hub.
    (amd64)
  3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
  4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/

For more examples and ideas, visit:
Get started | Docker Docs

(launcher) cc@CC-desktop-7921:~$

It seems that the docker status is correct, so and the rootless permission.
It did pulled the images then I dont know why I cannot train the model.

Here shows the list of the tao launcher:

(launcher) cc@CC-desktop-7921:~$ tao list
============== ================== =========
container_id container_status command
============== ================== =========
============== ================== =========
It seems that there are some containers but do not list their id.

You can run with either of below ways.

  1. TAO launcher
  2. docker run . Above log shows that you already run into the 5.5.0-PyT docker successfully.

Okay, I will try to use it directly within the container.
Then I wonder why the tao launcher does not work?
Does it be relevant to the python version?

On the TAO 5.50 toolkit it shows that the python requirement is >3.10 while within the notebook of the ocdnet, it shows the python version should be >3.6 <=3.10.

At very beginning, I tried to use python 3.10 but the tao model train prompt error caused by Breaks with requests 2.32.0: Not supported URL scheme http+docker, so I degrade my python to 3.8, then it works. I am not sure if the bug raised here are relevent to the python version.

I am trying to use the container directly to train my model, is there any documentation regarding which parameter should I convey to each model/container?

Yes, you can.
You can use docker run.
After login in the docker, you can run training without “tao” in the beginning.
For example,
$ ocdnet train xxx

1 Like

Another hint is from Tao toolkit observations - #63 by foreverneilyoung.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.