Tao toolkit version5 is getting error when comes to training part

anil.kumarp0255 · July 31, 2023, 7:22am

I am using the dino model for training the custom dataset everything ran well until training when it comes to training section getting the following error. I’m still having problems, even though I’ve tried few solutions from docker community.

For multi-GPU, change num_gpus in train.yaml based on your machine or pass --gpus to the cli.
For multi-node, change num_gpus and num_nodes in train.yaml based on your machine or pass --num_nodes to the cli.
2023-07-31 15:18:13,986 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2023-07-31 15:18:14,066 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
2023-07-31 15:18:14,107 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“unable to find user $(id -u): no matching entries in passwd file”)

Morganh · July 31, 2023, 8:24am

Can you pull this docker successfully?
$ docker pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt

anil.kumarp0255 · July 31, 2023, 8:40am

Yes I can pull the Image successfully with out any errors

Morganh · July 31, 2023, 8:41am

For the error, could you please share the full command and full log?

anil.kumarp0255 · July 31, 2023, 8:45am

I arranged all the commands in shell script and using the nvcr.io/nvidia/tao/tao-toolkit :5.0.0-tf1.15.5 container for training, the command i used for training is tao model dino train -e $SPECS_DIR/train.yaml results_dir=$RESULTS_DIR/

below is the log:

drwxrwxr-x 2 1000 1000 4096 Jul 28 05:47 annotations
drwxrwxr-x 2 1000 1000 4096 Jul 28 05:35 train2017
drwxrwxr-x 2 1000 1000 4096 Jul 28 05:36 val2017

Set the environment variables

export CLI=“ngccli_cat_linux.zip”
export LOCAL_PROJECT_DIR=“/root/tao-experiments”

Create the ngccli directory

mkdir -p “/workspace/tao-experiments/ngccli”

Remove any previously existing CLI installations

rm -rf “/workspace/tao-experiments/ngccli/*”

Download the NGC CLI

wget “https://ngc.nvidia.com/downloads/” -P “/workspace/tao-experiments/ngccli”

Unzip the downloaded file

unzip -u “/workspace/tao-experiments/ngccli/” -d “/workspace/tao-experiments/ngccli/”

Remove the downloaded zip file

rm “/workspace/tao-experiments/ngccli/”

Add ngc-cli to the PATH environment variable

export PATH=“/workspace/tao-experiments/ngccli/ngc-cli:/opt/nvidia/tools:/opt/openmpi/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin”

Run the NGC CLI command to list the models

ngc registry model list nvidia/tao/pretrained_dino_nvimagenet:*

Pull pretrained model from NGC

ngc registry model download-version nvidia/tao/pretrained_dino_nvimagenet:resnet50 --dest “/workspace/tao-experiments/dino/”
Check that model is downloaded into dir.
total 299924
-rw------- 1 root root 307117121 Jul 28 08:36 resnet50_nvimagenetv2.pth.tar
train:
num_gpus: 1
num_nodes: 1
validation_interval: 1
optim:
lr_backbone: 2e-05
lr: 2e-4
lr_steps: [11]
momentum: 0.9
num_epochs: 5
dataset:
train_data_sources:
- image_dir: /workspace/tao-experiments/data/raw-data/train2017/
json_file: /workspace/tao-experiments/data/raw-data/annotations/instances_train2017.json
val_data_sources:
- image_dir: /workspace/tao-experiments/data/raw-data/val2017/
json_file: /workspace/tao-experiments/data/raw-data/annotations/instances_val2017.json
num_classes: 11
batch_size: 4
workers: 8
augmentation:
fixed_padding: False
model:
backbone: resnet50
train_backbone: True
pretrained_backbone_path: /workspace/tao-experiments/dino/pretrained_dino_nvimagenet_vresnet50/resnet50_nvimagenetv2.pth.tar
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 300
num_select: 100
dropout_ratio: 0.0
dim_feedforward: 2048
/workspace/tao-experiments/data
For multi-GPU, change num_gpus in train.yaml based on your machine or pass --gpus to the cli.
For multi-node, change num_gpus and num_nodes in train.yaml based on your machine or pass --num_nodes to the cli.
2023-07-31 07:59:29,753 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2023-07-31 07:59:29,839 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
2023-07-31 07:59:29,889 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“unable to find user $(id -u): no matching entries in passwd file”)

Morganh · July 31, 2023, 9:01am

Can you share the script? Or could you use below command to check if the training works?
Refer to Working With the Containers - NVIDIA Docs and DINO - NVIDIA Docs

docker run -it --rm --gpus all
-v /path/in/host:/path/in/docker
http://nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
dino train -e /path/to/train.yaml
-r /path/to/results/dir
--gpus 2

anil.kumarp0255 · July 31, 2023, 9:37am

Dino_training.sh (3.5 KB)
I haved tried the command that you sent and its getting this error ./Dino_training.sh: line 128: dino: command not found
I am attaching my script for your reference please kindly check.

Morganh · July 31, 2023, 9:47am

Please use

tao model dino train -e $SPECS_DIR/train.yaml -r $RESULTS_DIR -k $KEY --gpus 2

Refer to dino notebook.
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/resources/tao-getting-started/version/5.0.0/files/notebooks/tao_launcher_starter_kit/dino/dino.ipynb

anil.kumarp0255 · July 31, 2023, 9:55am

This is the training command from tao toolkit version 5 I followed from jupyter note book only
!tao model dino train
-e $SPECS_DIR/train.yaml
results_dir=$RESULTS_DIR/

I didn’t see this part in command -k $KEY --gpus 2 but in version 3 have this part.

Morganh · July 31, 2023, 9:59am

Correct. In latest TAO5.0, “-k $KEY” is not needed.

anil.kumarp0255 · August 1, 2023, 9:00am

with out key also its not working i tried with all following three docker images and all are getting the same error that i mentioned above
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

Morganh · August 1, 2023, 9:21am

The dino is only working on nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt .
Could you trigger this docker and run again?

$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt /bin/bash

Then,
dino train -e your_train.yaml -r result

anil.kumarp0255 · August 1, 2023, 10:04am

ok i will try it and update and one more question during the training process it will automatically pull docker images from ngc right?

anil.kumarp0255 · August 1, 2023, 10:18am

I tried this and the result is same even if i run on host also same problem

Morganh · August 1, 2023, 2:43pm

Could you upload the full command and full log? Thanks.

More, are you running in Jetson devices?

anil.kumarp0255 · August 2, 2023, 12:54am

No I am not running on jetson device, this is my docker run command
““sudo docker run -it --runtime=nvidia -it -e DISPLAY=$DISPLAY -v /home/usr/tao-training/:/workspace/tao-training -v /home/usr/tao-experiments/:/workspace/tao-experiments -v /tmp/.X11-unix/:/tmp/.X11-unix -v /dev:/dev -v /var/run/docker.sock:/var/run/docker.sock -v /usr/bin/docker:/usr/bin/docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt /bin/bash””

and I uploaded the entire log also.
dinotraining (299.0 KB)

Morganh · August 2, 2023, 2:44am

To narrow down, instead of running shell script file, can you run again with below command directly?

sudo docker run -it --runtime=nvidia -it -e DISPLAY=$DISPLAY -v /home/usr/tao-training/:/workspace/tao-training -v /home/usr/tao-experiments/:/workspace/tao-experiments -v /tmp/.X11-unix/:/tmp/.X11-unix -v /dev:/dev -v /var/run/docker.sock:/var/run/docker.sock -v /usr/bin/docker:/usr/bin/docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt /bin/bash

anil.kumarp0255 · August 2, 2023, 2:49am

This command can run successfully and create the container also the error is coming inside the container when i try to start training.

Morganh · August 2, 2023, 2:53am

After enter the docker, can you run below command?
root@35ec36a31249:/opt/nvidia/tools# dino train

On my side, it will be

root@35ec36a31249:/opt/nvidia/tools# dino train
INFO: generated new fontManager
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
ERROR: The subtask train requires the following argument: -e/–experiment_spec_file
root@35ec36a31249:/opt/nvidia/tools#

anil.kumarp0255 · August 2, 2023, 3:41am

Thank you so much Morganh its working now however another problem is ocuuring now the following is the log ,please kindly check.

For multi-GPU, change num_gpus in train.yaml based on your machine or pass --gpus to the cli.
For multi-node, change num_gpus and num_nodes in train.yaml based on your machine or pass --num_nodes to the cli.
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
sys:1: UserWarning:
‘train.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
:107: UserWarning:
‘train.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Train results will be saved at: /results/train
Loaded pretrained weights from /opt/nvidia/tools/tao-experiments/dino/pretrained_dino_nvimagenet_vresnet50/resnet50_nvimagenetv2.pth.tar

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /results/train/lightning_logs
Serializing 0 elements to byte tensors and concatenating them all …
need at least one array to concatenate
Error executing job with overrides: [‘results_dir=/results’]
An error occurred during Hydra’s exception formatting:
AssertionError()
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 254, in run_and_report
assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “</usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py>”, line 3, in
File “”, line 209, in
File “”, line 107, in wrapper
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 389, in _run_hydra
_run_app(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 452, in _run_app
run_and_report(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 296, in run_and_report
raise ex
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 213, in run_and_report
return func()
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 453, in
lambda: hydra.run(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py”, line 132, in run
_ = ret.return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 260, in return_value
raise self._return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 186, in run_job
ret.return_value = task_function(task_cfg)
File “”, line 205, in main
File “”, line 194, in main
File “”, line 172, in run_experiment
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 603, in fit
call._call_and_handle_interrupt(
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py”, line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1037, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1284, in _call_setup_hook
self._call_lightning_datamodule_hook(“setup”, stage=fn)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1361, in _call_lightning_datamodule_hook
return fn(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/dataloader/od_data_module.py”, line 64, in setup
self.train_dataset = build_shm_dataset(train_data_sources, train_transform)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/dataloader/serialized_dataset.py”, line 117, in build_shm_dataset
dataset = SerializedDatasetFromList(dataset_list, transforms=transforms)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/dataloader/serialized_dataset.py”, line 146, in init
self._lst = np.concatenate(self._lst)
File “<array_function internals>”, line 180, in concatenate
ValueError: need at least one array to concatenate
Execution status: FAIL

Topic		Replies	Views
DINO Training failed :: Default process group has not been initialized TAO Toolkit	5	770	October 3, 2023
Error in TAO-Toolkit while training TAO Toolkit	15	1513	July 6, 2022
Probelm as running visual_changenet_classification on TAO launcher TAO Toolkit	41	1033	November 21, 2023
Train.yaml Doesn't exist! TAO Toolkit	16	487	June 11, 2024
Classification_pyt error TAO Toolkit jetson	16	98	September 18, 2024
Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct TAO Toolkit deepstream	70	1705	July 10, 2023
DINO: Error executing job with overrides TAO Toolkit	12	879	May 28, 2024
Tao Text Classification Evaluate failing TAO Toolkit	18	1366	October 12, 2021
LPRNet Error TAO Toolkit	13	230	June 19, 2024
Tao Text Classification Evaluate failing TAO Toolkit tao	5	1353	October 12, 2021