Tao setup with fine tunning of PCN

Hello TAO team,

I’m trying to fine-tune the Pose Classification Net using the TAO Toolkit PyTorch container, but I keep running into two sets of problems:

  1. When I invoke tao inside the container I get

/opt/nvidia/nvidia_entrypoint.sh: line 55: exec: tao: not found

  1. When I manage to call the local launcher CLI (pose_classification) instead of the container, I hit missing-module errors:

vbnet

ModuleNotFoundError: No module named 'hydra'
ModuleNotFoundError: No module named 'eff'

Below is a reproducible recipe and full environment dump.


1. Hardware

  • GPU: NVIDIA GeForce RTX 3090 (Driver 575.51.03, CUDA 12.9)
  • CPU: Intel i9-12900K (16 cores / 24 threads)
  • RAM: 125 GiB
  • OS: Ubuntu 24.04.1 LTS (Linux 6.8.0-59-generic)

2. Software

  • Docker: 28.1.1
  • nvidia-container-toolkit: 1.11.0
  • NGC CLI: 3.64.4
  • TAO Toolkit container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
  • Local TAO Launcher: installed via PyPI (tao-cli) in conda pcn env
  • Python (conda env “pcn”): 3.8.20
    • nvidia-tao-pytorch==5.1.0
    • torch==2.1.2+cu118
    • numpy==1.24.4

3. TAO Toolkit Version

$ docker run --rm nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt --version
# (Note: container prints “TAO Toolkit Version 5.5.0”) 

4. Experiment Spec

# /home/developer/tao/specs/experiment_pcn.yaml
results_dir: "/results/nvidia"
encryption_key: nvidia_tao

model:
  model_type: ST-GCN
  # …

dataset:
  train_dataset:
    data_path: "/data/nvidia/train_data.npy"
    label_path: "/data/nvidia/train_label.pkl"
  # …

train:
  optim:
    lr: 0.1
    # …

dataset_convert:
  pose_type: "3dbp"
  num_joints: 34
  # …

5. Steps to Reproduce

a) Download pretrained checkpoint

export KEY=<YOUR_NGC_API_KEY>
mkdir -p ~/tao/results/pretrained
ngc registry model download-version \
  nvidia/tao/poseclassificationnet:trainable_v1.0 \
  --dest ~/tao/results/pretrained
export PRETRAIN=~/tao/results/pretrained/poseclassificationnet_vtrainable_v1.0/model.step-049.tlt

b) Local tao launcher (conda pcn env)

export SPECS_DIR=~/tao/specs
export RESULTS_DIR=~/tao/results

tao model pose_classification train \
  -e "$SPECS_DIR/experiment_pcn.yaml" \
  -r "$RESULTS_DIR/finetune_run" \
  -k "$KEY" \
  --gpus 1 \
  resume_training_checkpoint_path="$PRETRAIN"

Error:

usage: train.py [--help] [--hydra-help] [--version] [--cfg {job,hydra,all}] [--resolve] [--package PACKAGE]
                [--run] [--multirun] [--shell-completion] [--config-path CONFIG_PATH] [--config-name CONFIG_NAME]
                [--config-dir CONFIG_DIR] [--info [{all,config,defaults,defaults-tree,plugins,searchpath}]]
                [overrides [overrides ...]]
train.py: error: unrecognized arguments: --gpus 1 resume_training_checkpoint_path=
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: 'str' object has no attribute 'decode'
Execution status: FAIL

c) Container approach still fails locating the tao binary

docker run --gpus all --rm \
  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  -e KEY="$KEY" \
  -v ~/.ngc:/root/.ngc \
  -v ~/tao/data:/workspace/data \
  -v ~/tao/specs:/workspace/specs \
  -v ~/tao/results:/workspace/results \
  nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt \
  tao model pose_classification train \
    -e /workspace/specs/experiment_pcn.yaml \
    -r /workspace/results/finetune_run \
    -k $KEY --gpus 1 \
    resume_training_checkpoint_path=/workspace/results/pretrained/poseclassificationnet_vtrainable_v1.0/model.step-049.tlt
/opt/nvidia/nvidia_entrypoint.sh: line 55: exec: tao: not found

6. Questions

  1. Local CLI: Why is train.py treating --gpus and resume_training_checkpoint_path= as unknown args?
  2. Container: Where is the tao entrypoint inside 5.5.0-pyt, and how can I invoke it properly?
  3. Dependencies: What’s the correct way to install/configure hydra, eff, and the encryption helpers for Pose Classification?

Thank you for your guidance!

6. Full Environment Dump

(see attached text file or paste the output of the diagnostic script)

$ ./dump_env.sh
=== OS & Kernel ===
Distributor ID: Ubuntu
Description:    Ubuntu 24.04.1 LTS
Release:        24.04
Codename:       noble
Linux 6.8.0-59-generic

=== CPU ===
…  

=== RAM ===
…  

=== Disk usage, lsblk, lspci, nvidia-smi, docker --version, nvidia-container-cli info …

=== TAO Launcher & NGC CLI ===
tao: usage: …  (no `info --verbose` subcommand)
NGC CLI 3.64.4

=== Conda Environments & Python ===
Python 3.8.20  
conda envs:  
  base, mynatek-yolo, pcn (*), pcn-env, tao_env, …

You do not need to install tao-launcher when you run docker run xxx.
That means you can ignore tao-launcher.

After you run docker run xxx, you can run command without tao model in the beginning. For example,
pose_classification train xxx