Hello TAO team,
I’m trying to fine-tune the Pose Classification Net using the TAO Toolkit PyTorch container, but I keep running into two sets of problems:
- When I invoke
taoinside the container I get
/opt/nvidia/nvidia_entrypoint.sh: line 55: exec: tao: not found
- When I manage to call the local launcher CLI (
pose_classification) instead of the container, I hit missing-module errors:
vbnet
ModuleNotFoundError: No module named 'hydra'
ModuleNotFoundError: No module named 'eff'
Below is a reproducible recipe and full environment dump.
1. Hardware
- GPU: NVIDIA GeForce RTX 3090 (Driver 575.51.03, CUDA 12.9)
- CPU: Intel i9-12900K (16 cores / 24 threads)
- RAM: 125 GiB
- OS: Ubuntu 24.04.1 LTS (Linux 6.8.0-59-generic)
2. Software
- Docker: 28.1.1
- nvidia-container-toolkit: 1.11.0
- NGC CLI: 3.64.4
- TAO Toolkit container:
nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt - Local TAO Launcher: installed via PyPI (
tao-cli) in condapcnenv - Python (conda env “pcn”): 3.8.20
nvidia-tao-pytorch==5.1.0torch==2.1.2+cu118numpy==1.24.4
3. TAO Toolkit Version
$ docker run --rm nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt --version
# (Note: container prints “TAO Toolkit Version 5.5.0”)
4. Experiment Spec
# /home/developer/tao/specs/experiment_pcn.yaml
results_dir: "/results/nvidia"
encryption_key: nvidia_tao
model:
model_type: ST-GCN
# …
dataset:
train_dataset:
data_path: "/data/nvidia/train_data.npy"
label_path: "/data/nvidia/train_label.pkl"
# …
train:
optim:
lr: 0.1
# …
dataset_convert:
pose_type: "3dbp"
num_joints: 34
# …
5. Steps to Reproduce
a) Download pretrained checkpoint
export KEY=<YOUR_NGC_API_KEY>
mkdir -p ~/tao/results/pretrained
ngc registry model download-version \
nvidia/tao/poseclassificationnet:trainable_v1.0 \
--dest ~/tao/results/pretrained
export PRETRAIN=~/tao/results/pretrained/poseclassificationnet_vtrainable_v1.0/model.step-049.tlt
b) Local tao launcher (conda pcn env)
export SPECS_DIR=~/tao/specs
export RESULTS_DIR=~/tao/results
tao model pose_classification train \
-e "$SPECS_DIR/experiment_pcn.yaml" \
-r "$RESULTS_DIR/finetune_run" \
-k "$KEY" \
--gpus 1 \
resume_training_checkpoint_path="$PRETRAIN"
Error:
usage: train.py [--help] [--hydra-help] [--version] [--cfg {job,hydra,all}] [--resolve] [--package PACKAGE]
[--run] [--multirun] [--shell-completion] [--config-path CONFIG_PATH] [--config-name CONFIG_NAME]
[--config-dir CONFIG_DIR] [--info [{all,config,defaults,defaults-tree,plugins,searchpath}]]
[overrides [overrides ...]]
train.py: error: unrecognized arguments: --gpus 1 resume_training_checkpoint_path=
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: 'str' object has no attribute 'decode'
Execution status: FAIL
c) Container approach still fails locating the tao binary
docker run --gpus all --rm \
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-e KEY="$KEY" \
-v ~/.ngc:/root/.ngc \
-v ~/tao/data:/workspace/data \
-v ~/tao/specs:/workspace/specs \
-v ~/tao/results:/workspace/results \
nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt \
tao model pose_classification train \
-e /workspace/specs/experiment_pcn.yaml \
-r /workspace/results/finetune_run \
-k $KEY --gpus 1 \
resume_training_checkpoint_path=/workspace/results/pretrained/poseclassificationnet_vtrainable_v1.0/model.step-049.tlt
/opt/nvidia/nvidia_entrypoint.sh: line 55: exec: tao: not found
6. Questions
- Local CLI: Why is
train.pytreating--gpusandresume_training_checkpoint_path=as unknown args? - Container: Where is the
taoentrypoint inside5.5.0-pyt, and how can I invoke it properly? - Dependencies: What’s the correct way to install/configure hydra, eff, and the encryption helpers for Pose Classification?
Thank you for your guidance!
6. Full Environment Dump
(see attached text file or paste the output of the diagnostic script)
$ ./dump_env.sh
=== OS & Kernel ===
Distributor ID: Ubuntu
Description: Ubuntu 24.04.1 LTS
Release: 24.04
Codename: noble
Linux 6.8.0-59-generic
=== CPU ===
…
=== RAM ===
…
=== Disk usage, lsblk, lspci, nvidia-smi, docker --version, nvidia-container-cli info …
=== TAO Launcher & NGC CLI ===
tao: usage: … (no `info --verbose` subcommand)
NGC CLI 3.64.4
=== Conda Environments & Python ===
Python 3.8.20
conda envs:
base, mynatek-yolo, pcn (*), pcn-env, tao_env, …