Error in classification_pyt train

Please provide the following information when requesting support.
• Hardware (GeForce RTX 4070)
• Network Type Classification (classification_pyt, pretrained_fan_hybrid_small)
• tao Version 5.2

I’m trying to run the example notebook classification_pyt train, without any changes. When I run the train command, I get the following error:

OCI runtime exec failed: exec failed: unable to start container process: exec: “classification_pyt”: executable file not found in $PATH: unknown
2024-01-02 15:42:42,245 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

How can I solve this error and train the model using this notebook?

Thank you in advance for the help!

Can you share the full command and full logs? Maybe screenshots are also helpful.

Thank you for your fast response! Here’s a screenshot with the full command and the full log.

Can you run below and share the result?
! tao info --verbose

The result is:

Configuration of the TAO Toolkit Instance

task_group:
model:
dockers:
nvidia/tao/tao-toolkit:
5.0.0-tf2.11.0:
docker_registry: nvcr.io
tasks:
1. classification_tf2
2. efficientdet_tf2
5.0.0-tf1.15.5:
docker_registry: nvcr.io
tasks:
1. bpnet
2. classification_tf1
3. converter
4. detectnet_v2
5. dssd
6. efficientdet_tf1
7. faster_rcnn
8. fpenet
9. lprnet
10. mask_rcnn
11. multitask_classification
12. retinanet
13. ssd
14. unet
15. yolo_v3
16. yolo_v4
17. yolo_v4_tiny
5.2.0-pyt2.1.0:
docker_registry: nvcr.io
tasks:
1. action_recognition
2. centerpose
3. deformable_detr
4. dino
5. mal
6. ml_recog
7. ocdnet
8. ocrnet
9. optical_inspection
10. pointpillars
11. pose_classification
12. re_identification
13. visual_changenet
5.2.0-pyt1.14.0:
docker_registry: nvcr.io
tasks:
1. classification_pyt
2. segformer
dataset:
dockers:
nvidia/tao/tao-toolkit:
5.2.0-data-services:
docker_registry: nvcr.io
tasks:
1. augmentation
2. auto_label
3. annotations
4. analytics
deploy:
dockers:
nvidia/tao/tao-toolkit:
5.2.0-deploy:
docker_registry: nvcr.io
tasks:
1. visual_changenet
2. centerpose
3. classification_pyt
4. classification_tf1
5. classification_tf2
6. deformable_detr
7. detectnet_v2
8. dino
9. dssd
10. efficientdet_tf1
11. efficientdet_tf2
12. faster_rcnn
13. lprnet
14. mask_rcnn
15. ml_recog
16. multitask_classification
17. ocdnet
18. ocrnet
19. optical_inspection
20. retinanet
21. segformer
22. ssd
23. trtexec
24. unet
25. yolo_v3
26. yolo_v4
27. yolo_v4_tiny
format_version: 3.0
toolkit_version: 5.2.0
published_date: 12/06/2023

Can you open a new terminal instead and trigger notebook again?
$ docker run --runtime=nvidia -it --rm -p 8888:8888 nvcr.io/nvidia/tao/tao-toolkit:5.2.0-pyt1.14.0 /bin/bash

Then, check the command of “classification_pyt”.
root@8d6c08489e41:/workspace# classification_pyt train

or trigger notebook,
root@8d6c08489e41:/workspace# jupyter notebook --ip 0.0.0.0 --allow-root

Running $ docker run --runtime=nvidia -it --rm -p 8888:8888 nvcr.io/nvidia/tao/tao-toolkit:5.2.0-pyt1.14.0 /bin/bash leads to the following error:

chmod: cannot access ‘/opt/ngccli/ngc’: No such file or directory

Please use below new one.
$ docker run --runtime=nvidia -it --rm -p 8888:8888 nvcr.io/nvidia/tao/tao-toolkit:5.2.0.1-pyt1.14.0 /bin/bash

Using this new one, and then running classification_pyt train with the spec file from the notebook, I get the following error:

Error executing job with overrides: []
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 132, in main
    raise e
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 118, in main
    results_dir = os.path.join(cfg.results_dir, "train")
omegaconf.errors.MissingMandatoryValue: Missing mandatory value: results_dir
    full_key: results_dir
    object_type=ExperimentConfig

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 373) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.14.0a0+44dac51', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-04_08:20:25
  host      : edba0f4d209a
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 373)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: module 'urllib3.exceptions' has no attribute 'SubjectAltNameWarning'
Execution status: FAIL

Please double check and set -r argument. You can set it explicitly.

With the -r argument it runs, but at the end of the epochs it gives a warning:

Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: module ‘urllib3.exceptions’ has no attribute ‘SubjectAltNameWarning’
Execution status: PASS

It saves the epochs as .pth and some json files, but no .tlt files or equivalent. Is that supposed to be the case?

Yes, it is expected in TAO 5.0. In previous TAO pytorch docker, the .tlt file is actually encrypted pytorch model.

Perfect, then it’s working. Thank you for all the help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.