Tao toolkit version5 is getting error when comes to training part

I am using the dino model for training the custom dataset everything ran well until training when it comes to training section getting the following error. I’m still having problems, even though I’ve tried few solutions from docker community.

For multi-GPU, change num_gpus in train.yaml based on your machine or pass --gpus to the cli.
For multi-node, change num_gpus and num_nodes in train.yaml based on your machine or pass --num_nodes to the cli.
2023-07-31 15:18:13,986 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2023-07-31 15:18:14,066 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
2023-07-31 15:18:14,107 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“unable to find user $(id -u): no matching entries in passwd file”)

Can you pull this docker successfully?
$ docker pull nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt

Yes I can pull the Image successfully with out any errors

For the error, could you please share the full command and full log?

I arranged all the commands in shell script and using the nvcr.io/nvidia/tao/tao-toolkit :5.0.0-tf1.15.5 container for training, the command i used for training is tao model dino train -e $SPECS_DIR/train.yaml results_dir=$RESULTS_DIR/

below is the log:

drwxrwxr-x 2 1000 1000 4096 Jul 28 05:47 annotations
drwxrwxr-x 2 1000 1000 4096 Jul 28 05:35 train2017
drwxrwxr-x 2 1000 1000 4096 Jul 28 05:36 val2017

Set the environment variables

export CLI=“ngccli_cat_linux.zip”
export LOCAL_PROJECT_DIR=“/root/tao-experiments”

Create the ngccli directory

mkdir -p “/workspace/tao-experiments/ngccli”

Remove any previously existing CLI installations

rm -rf “/workspace/tao-experiments/ngccli/*”

Download the NGC CLI

wget “https://ngc.nvidia.com/downloads/” -P “/workspace/tao-experiments/ngccli”

Unzip the downloaded file

unzip -u “/workspace/tao-experiments/ngccli/” -d “/workspace/tao-experiments/ngccli/”

Remove the downloaded zip file

rm “/workspace/tao-experiments/ngccli/”

Add ngc-cli to the PATH environment variable

export PATH=“/workspace/tao-experiments/ngccli/ngc-cli:/opt/nvidia/tools:/opt/openmpi/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin”

Run the NGC CLI command to list the models

ngc registry model list nvidia/tao/pretrained_dino_nvimagenet:*

Pull pretrained model from NGC

ngc registry model download-version nvidia/tao/pretrained_dino_nvimagenet:resnet50 --dest “/workspace/tao-experiments/dino/”
Check that model is downloaded into dir.
total 299924
-rw------- 1 root root 307117121 Jul 28 08:36 resnet50_nvimagenetv2.pth.tar
train:
num_gpus: 1
num_nodes: 1
validation_interval: 1
optim:
lr_backbone: 2e-05
lr: 2e-4
lr_steps: [11]
momentum: 0.9
num_epochs: 5
dataset:
train_data_sources:
- image_dir: /workspace/tao-experiments/data/raw-data/train2017/
json_file: /workspace/tao-experiments/data/raw-data/annotations/instances_train2017.json
val_data_sources:
- image_dir: /workspace/tao-experiments/data/raw-data/val2017/
json_file: /workspace/tao-experiments/data/raw-data/annotations/instances_val2017.json
num_classes: 11
batch_size: 4
workers: 8
augmentation:
fixed_padding: False
model:
backbone: resnet50
train_backbone: True
pretrained_backbone_path: /workspace/tao-experiments/dino/pretrained_dino_nvimagenet_vresnet50/resnet50_nvimagenetv2.pth.tar
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 300
num_select: 100
dropout_ratio: 0.0
dim_feedforward: 2048
/workspace/tao-experiments/data
For multi-GPU, change num_gpus in train.yaml based on your machine or pass --gpus to the cli.
For multi-node, change num_gpus and num_nodes in train.yaml based on your machine or pass --num_nodes to the cli.
2023-07-31 07:59:29,753 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2023-07-31 07:59:29,839 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
2023-07-31 07:59:29,889 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
Docker instantiation failed with error: 500 Server Error: Internal Server Error (“unable to find user $(id -u): no matching entries in passwd file”)

Can you share the script? Or could you use below command to check if the training works?
Refer to Working With the Containers - NVIDIA Docs and DINO - NVIDIA Docs

docker run -it --rm --gpus all
-v /path/in/host:/path/in/docker
http://nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
dino train -e /path/to/train.yaml
-r /path/to/results/dir
--gpus 2

Dino_training.sh (3.5 KB)
I haved tried the command that you sent and its getting this error ./Dino_training.sh: line 128: dino: command not found
I am attaching my script for your reference please kindly check.

Please use

tao model dino train -e $SPECS_DIR/train.yaml -r $RESULTS_DIR -k $KEY --gpus 2

Refer to dino notebook.
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/resources/tao-getting-started/version/5.0.0/files/notebooks/tao_launcher_starter_kit/dino/dino.ipynb

This is the training command from tao toolkit version 5 I followed from jupyter note book only
!tao model dino train
-e $SPECS_DIR/train.yaml
results_dir=$RESULTS_DIR/

I didn’t see this part in command -k $KEY --gpus 2 but in version 3 have this part.

Correct. In latest TAO5.0, “-k $KEY” is not needed.

with out key also its not working i tried with all following three docker images and all are getting the same error that i mentioned above
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

The dino is only working on nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt .
Could you trigger this docker and run again?

$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt /bin/bash

Then,
dino train -e your_train.yaml -r result

ok i will try it and update and one more question during the training process it will automatically pull docker images from ngc right?

I tried this and the result is same even if i run on host also same problem

Could you upload the full command and full log? Thanks.

More, are you running in Jetson devices?

No I am not running on jetson device, this is my docker run command
““sudo docker run -it --runtime=nvidia -it -e DISPLAY=$DISPLAY -v /home/usr/tao-training/:/workspace/tao-training -v /home/usr/tao-experiments/:/workspace/tao-experiments -v /tmp/.X11-unix/:/tmp/.X11-unix -v /dev:/dev -v /var/run/docker.sock:/var/run/docker.sock -v /usr/bin/docker:/usr/bin/docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt /bin/bash””

and I uploaded the entire log also.
dinotraining (299.0 KB)

To narrow down, instead of running shell script file, can you run again with below command directly?

sudo docker run -it --runtime=nvidia -it -e DISPLAY=$DISPLAY -v /home/usr/tao-training/:/workspace/tao-training -v /home/usr/tao-experiments/:/workspace/tao-experiments -v /tmp/.X11-unix/:/tmp/.X11-unix -v /dev:/dev -v /var/run/docker.sock:/var/run/docker.sock -v /usr/bin/docker:/usr/bin/docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt /bin/bash

This command can run successfully and create the container also the error is coming inside the container when i try to start training.

After enter the docker, can you run below command?
root@35ec36a31249:/opt/nvidia/tools# dino train

On my side, it will be

root@35ec36a31249:/opt/nvidia/tools# dino train
INFO: generated new fontManager
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
ERROR: The subtask train requires the following argument: -e/–experiment_spec_file
root@35ec36a31249:/opt/nvidia/tools#

Thank you so much Morganh its working now however another problem is ocuuring now the following is the log ,please kindly check.

For multi-GPU, change num_gpus in train.yaml based on your machine or pass --gpus to the cli.
For multi-node, change num_gpus and num_nodes in train.yaml based on your machine or pass --num_nodes to the cli.
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
sys:1: UserWarning:
‘train.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
:107: UserWarning:
‘train.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Train results will be saved at: /results/train
Loaded pretrained weights from /opt/nvidia/tools/tao-experiments/dino/pretrained_dino_nvimagenet_vresnet50/resnet50_nvimagenetv2.pth.tar

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /results/train/lightning_logs
Serializing 0 elements to byte tensors and concatenating them all …
need at least one array to concatenate
Error executing job with overrides: [‘results_dir=/results’]
An error occurred during Hydra’s exception formatting:
AssertionError()
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 254, in run_and_report
assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “</usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py>”, line 3, in
File “”, line 209, in
File “”, line 107, in wrapper
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 389, in _run_hydra
_run_app(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 452, in _run_app
run_and_report(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 296, in run_and_report
raise ex
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 213, in run_and_report
return func()
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 453, in
lambda: hydra.run(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py”, line 132, in run
_ = ret.return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 260, in return_value
raise self._return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 186, in run_job
ret.return_value = task_function(task_cfg)
File “”, line 205, in main
File “”, line 194, in main
File “”, line 172, in run_experiment
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 603, in fit
call._call_and_handle_interrupt(
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py”, line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1037, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1284, in _call_setup_hook
self._call_lightning_datamodule_hook(“setup”, stage=fn)
File “/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1361, in _call_lightning_datamodule_hook
return fn(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/dataloader/od_data_module.py”, line 64, in setup
self.train_dataset = build_shm_dataset(train_data_sources, train_transform)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/dataloader/serialized_dataset.py”, line 117, in build_shm_dataset
dataset = SerializedDatasetFromList(dataset_list, transforms=transforms)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/dataloader/serialized_dataset.py”, line 146, in init
self._lst = np.concatenate(self._lst)
File “<array_function internals>”, line 180, in concatenate
ValueError: need at least one array to concatenate
Execution status: FAIL