Bug: TAO Toolkit can't run interactively

The “TAO Toolkit for Computer Vision” crashes on startup due to a bad filepath.

Is this bug report correct and getting tracked?

Problem

Both of these produce the same result:
nvcr.io/nvidia/tao/tao-toolkit-tf v3.22.05-tf1.15.5-py3 b85103564252
nvcr.io/nvidia/tao/tao-toolkit-tf v3.22.05-tf1.15.4-py3 ca92a571a959

$ docker run --network=host -it --rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
--2022-11-16 15:59:43--  https://ngc.nvidia.com/downloads/ngccli_reg_linux.zip
Resolving ngc.nvidia.com (ngc.nvidia.com)... 99.84.208.14, 99.84.208.8, 99.84.208.59, ...
Connecting to ngc.nvidia.com (ngc.nvidia.com)|99.84.208.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39221496 (37M) [application/zip]
Saving to: ‘/opt/ngccli/ngccli_reg_linux.zip’

ngccli_reg_linux.zip                          100%[============>]  37.40M  15.7MB/s    in 2.4s

2022-11-16 15:59:46 (15.7 MB/s) - ‘/opt/ngccli/ngccli_reg_linux.zip’ saved [39221496/39221496]

Archive:  /opt/ngccli/ngccli_reg_linux.zip
...
chmod: cannot access '/opt/ngccli/ngc': No such file or directory

$ echo $?
1

Cause

This is due to line 23 of /install_ngc_cli.sh,

chmod u+x /opt/ngccli/ngc

because ngc is actually extracted to /opt/ngccli/ngc-cli/ngc, and not /opt/ngccli/ngc.

I observed this by replacing the entrypoint (previously /install_ngc_cli.sh), where one can run that manually and observe the filesystem:

$ docker run --network=host -it --rm --entrypoint "/bin/bash" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
root@$HOST:/workspace# /install_ngc_cli.sh
...
chmod: cannot access '/opt/ngccli/ngc': No such file or directory

root@$HOST:/workspace# find / -type f -name ngc
/opt/ngccli/ngc-cli/ngc

Fix

One can modify the image like so to prevent the crash:

# file:  /tmp/tao_fix/dockerfile
FROM nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
RUN mkdir /opt/ngccli && ln -s -T /opt/ngccli/ngc-cli/ngc /opt/ngccli/ngc && /install_ngc_cli.sh
ENTRYPOINT [ "/bin/bash" ]
$ docker build --ssh default --network=host -t tao_fix -f  /tmp/tao_fix/dockerfile /tmp/tao_fix
Sending build context to Docker daemon  14.85kB
Step 1/3 : FROM nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
 ---> b85103564252
Step 2/3 : RUN mkdir /opt/ngccli && ln -s -T /opt/ngccli/ngc-cli/ngc /opt/ngccli/ngc && /install_ngc_cli.sh
 ---> Using cache
 ---> a435a881da46
Step 3/3 : ENTRYPOINT [ "/bin/bash" ]
 ---> Running in bbff6b43ec6f
Removing intermediate container bbff6b43ec6f
 ---> 6755508ae805
Successfully built 6755508ae805
Successfully tagged tao_fix:latest

$ docker run --network=host -it --rm tao_fix

root@h42-ausl-wk19:/workspace# /opt/ngccli/ngc -h
usage: ngc [--debug] [--format_type <fmt>] [--version] [-h] {config,diag,pym,registry,version} ...

NVIDIA NGC CLI

optional arguments:
  -h, --help            Show this help message and exit.
  --debug               Enable debug mode.
  --format_type <fmt>   Specify the output format type. Supported formats are: ascii, csv, json. Only commands that produce tabular data support csv format. Default: ascii
  --version             Show the CLI version and exit.

ngc:
  {config,diag,pym,registry,version}
    config              Configuration Commands
    diag                Diagnostic Commands
    pym                 PyM Commands
    registry            Registry Commands
    version             Version Commands
root@h42-ausl-wk19:/workspace#
root@$HOST:/workspace# exit
$ echo $?
0

Requested Info

• Hardware: (RTX 3070 & A6000)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc): N/A
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): N/A
• Training spec file(If have, please share here): N/A
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.): see above

If you are using tao-launcher, please install latest nvidia-tao.
$ pip3 install nvidia-tao

If you are using docker run, please add --entrypoint ""
For example,
$ docker run --runtime=nvidia -it --rm --entrypoint "" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 /bin/bash

yes
removing the /install_ngc_cli.sh from the entrypoint of the container does prevent the crash, though install_ngc_cli.sh would still fail

do I understand correctly that this should be ignored?

Yes. For the root cause and solution, please refer to Chmod: cannot access '/opt/ngccli/ngc': No such file or directory - #2 by Morganh

much appreciated

apologies for the duplicate

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.