Hello TAO team,
I’m trying to fine-tune the Pose Classification Net using the TAO Toolkit PyTorch container, but I keep running into two sets of problems:
- When I invoke
tao
inside the container I get
/opt/nvidia/nvidia_entrypoint.sh: line 55: exec: tao: not found
- When I manage to call the local launcher CLI (
pose_classification
) instead of the container, I hit missing-module errors:
vbnet
ModuleNotFoundError: No module named 'hydra'
ModuleNotFoundError: No module named 'eff'
Below is a reproducible recipe and full environment dump.
1. Hardware
- GPU: NVIDIA GeForce RTX 3090 (Driver 575.51.03, CUDA 12.9)
- CPU: Intel i9-12900K (16 cores / 24 threads)
- RAM: 125 GiB
- OS: Ubuntu 24.04.1 LTS (Linux 6.8.0-59-generic)
2. Software
- Docker: 28.1.1
- nvidia-container-toolkit: 1.11.0
- NGC CLI: 3.64.4
- TAO Toolkit container:
nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
- Local TAO Launcher: installed via PyPI (
tao-cli
) in condapcn
env - Python (conda env “pcn”): 3.8.20
nvidia-tao-pytorch==5.1.0
torch==2.1.2+cu118
numpy==1.24.4
3. TAO Toolkit Version
$ docker run --rm nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt --version
# (Note: container prints “TAO Toolkit Version 5.5.0”)
4. Experiment Spec
# /home/developer/tao/specs/experiment_pcn.yaml
results_dir: "/results/nvidia"
encryption_key: nvidia_tao
model:
model_type: ST-GCN
# …
dataset:
train_dataset:
data_path: "/data/nvidia/train_data.npy"
label_path: "/data/nvidia/train_label.pkl"
# …
train:
optim:
lr: 0.1
# …
dataset_convert:
pose_type: "3dbp"
num_joints: 34
# …
5. Steps to Reproduce
a) Download pretrained checkpoint
export KEY=<YOUR_NGC_API_KEY>
mkdir -p ~/tao/results/pretrained
ngc registry model download-version \
nvidia/tao/poseclassificationnet:trainable_v1.0 \
--dest ~/tao/results/pretrained
export PRETRAIN=~/tao/results/pretrained/poseclassificationnet_vtrainable_v1.0/model.step-049.tlt
b) Local tao
launcher (conda pcn
env)
export SPECS_DIR=~/tao/specs
export RESULTS_DIR=~/tao/results
tao model pose_classification train \
-e "$SPECS_DIR/experiment_pcn.yaml" \
-r "$RESULTS_DIR/finetune_run" \
-k "$KEY" \
--gpus 1 \
resume_training_checkpoint_path="$PRETRAIN"
Error:
usage: train.py [--help] [--hydra-help] [--version] [--cfg {job,hydra,all}] [--resolve] [--package PACKAGE]
[--run] [--multirun] [--shell-completion] [--config-path CONFIG_PATH] [--config-name CONFIG_NAME]
[--config-dir CONFIG_DIR] [--info [{all,config,defaults,defaults-tree,plugins,searchpath}]]
[overrides [overrides ...]]
train.py: error: unrecognized arguments: --gpus 1 resume_training_checkpoint_path=
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: 'str' object has no attribute 'decode'
Execution status: FAIL
c) Container approach still fails locating the tao
binary
docker run --gpus all --rm \
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-e KEY="$KEY" \
-v ~/.ngc:/root/.ngc \
-v ~/tao/data:/workspace/data \
-v ~/tao/specs:/workspace/specs \
-v ~/tao/results:/workspace/results \
nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt \
tao model pose_classification train \
-e /workspace/specs/experiment_pcn.yaml \
-r /workspace/results/finetune_run \
-k $KEY --gpus 1 \
resume_training_checkpoint_path=/workspace/results/pretrained/poseclassificationnet_vtrainable_v1.0/model.step-049.tlt
/opt/nvidia/nvidia_entrypoint.sh: line 55: exec: tao: not found
6. Questions
- Local CLI: Why is
train.py
treating--gpus
andresume_training_checkpoint_path=
as unknown args? - Container: Where is the
tao
entrypoint inside5.5.0-pyt
, and how can I invoke it properly? - Dependencies: What’s the correct way to install/configure hydra, eff, and the encryption helpers for Pose Classification?
Thank you for your guidance!
6. Full Environment Dump
(see attached text file or paste the output of the diagnostic script)
$ ./dump_env.sh
=== OS & Kernel ===
Distributor ID: Ubuntu
Description: Ubuntu 24.04.1 LTS
Release: 24.04
Codename: noble
Linux 6.8.0-59-generic
=== CPU ===
…
=== RAM ===
…
=== Disk usage, lsblk, lspci, nvidia-smi, docker --version, nvidia-container-cli info …
=== TAO Launcher & NGC CLI ===
tao: usage: … (no `info --verbose` subcommand)
NGC CLI 3.64.4
=== Conda Environments & Python ===
Python 3.8.20
conda envs:
base, mynatek-yolo, pcn (*), pcn-env, tao_env, …