The “TAO Toolkit for Computer Vision” crashes on startup due to a bad filepath.
Is this bug report correct and getting tracked?
Problem
Both of these produce the same result:
nvcr.io/nvidia/tao/tao-toolkit-tf v3.22.05-tf1.15.5-py3 b85103564252
nvcr.io/nvidia/tao/tao-toolkit-tf v3.22.05-tf1.15.4-py3 ca92a571a959
$ docker run --network=host -it --rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
--2022-11-16 15:59:43-- https://ngc.nvidia.com/downloads/ngccli_reg_linux.zip
Resolving ngc.nvidia.com (ngc.nvidia.com)... 99.84.208.14, 99.84.208.8, 99.84.208.59, ...
Connecting to ngc.nvidia.com (ngc.nvidia.com)|99.84.208.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39221496 (37M) [application/zip]
Saving to: ‘/opt/ngccli/ngccli_reg_linux.zip’
ngccli_reg_linux.zip 100%[============>] 37.40M 15.7MB/s in 2.4s
2022-11-16 15:59:46 (15.7 MB/s) - ‘/opt/ngccli/ngccli_reg_linux.zip’ saved [39221496/39221496]
Archive: /opt/ngccli/ngccli_reg_linux.zip
...
chmod: cannot access '/opt/ngccli/ngc': No such file or directory
$ echo $?
1
Cause
This is due to line 23 of /install_ngc_cli.sh
,
chmod u+x /opt/ngccli/ngc
because ngc
is actually extracted to /opt/ngccli/ngc-cli/ngc
, and not /opt/ngccli/ngc
.
I observed this by replacing the entrypoint (previously /install_ngc_cli.sh
), where one can run that manually and observe the filesystem:
$ docker run --network=host -it --rm --entrypoint "/bin/bash" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
root@$HOST:/workspace# /install_ngc_cli.sh
...
chmod: cannot access '/opt/ngccli/ngc': No such file or directory
root@$HOST:/workspace# find / -type f -name ngc
/opt/ngccli/ngc-cli/ngc
Fix
One can modify the image like so to prevent the crash:
# file: /tmp/tao_fix/dockerfile
FROM nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
RUN mkdir /opt/ngccli && ln -s -T /opt/ngccli/ngc-cli/ngc /opt/ngccli/ngc && /install_ngc_cli.sh
ENTRYPOINT [ "/bin/bash" ]
$ docker build --ssh default --network=host -t tao_fix -f /tmp/tao_fix/dockerfile /tmp/tao_fix
Sending build context to Docker daemon 14.85kB
Step 1/3 : FROM nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
---> b85103564252
Step 2/3 : RUN mkdir /opt/ngccli && ln -s -T /opt/ngccli/ngc-cli/ngc /opt/ngccli/ngc && /install_ngc_cli.sh
---> Using cache
---> a435a881da46
Step 3/3 : ENTRYPOINT [ "/bin/bash" ]
---> Running in bbff6b43ec6f
Removing intermediate container bbff6b43ec6f
---> 6755508ae805
Successfully built 6755508ae805
Successfully tagged tao_fix:latest
$ docker run --network=host -it --rm tao_fix
root@h42-ausl-wk19:/workspace# /opt/ngccli/ngc -h
usage: ngc [--debug] [--format_type <fmt>] [--version] [-h] {config,diag,pym,registry,version} ...
NVIDIA NGC CLI
optional arguments:
-h, --help Show this help message and exit.
--debug Enable debug mode.
--format_type <fmt> Specify the output format type. Supported formats are: ascii, csv, json. Only commands that produce tabular data support csv format. Default: ascii
--version Show the CLI version and exit.
ngc:
{config,diag,pym,registry,version}
config Configuration Commands
diag Diagnostic Commands
pym PyM Commands
registry Registry Commands
version Version Commands
root@h42-ausl-wk19:/workspace#
root@$HOST:/workspace# exit
$ echo $?
0
Requested Info
• Hardware: (RTX 3070 & A6000)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc): N/A
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): N/A
• Training spec file(If have, please share here): N/A
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.): see above