Tao setup with fine tunning of PCN

nawazktk99 · May 12, 2025, 1:28pm

Hello TAO team,

I’m trying to fine-tune the Pose Classification Net using the TAO Toolkit PyTorch container, but I keep running into two sets of problems:

When I invoke tao inside the container I get

/opt/nvidia/nvidia_entrypoint.sh: line 55: exec: tao: not found

When I manage to call the local launcher CLI (pose_classification) instead of the container, I hit missing-module errors:

vbnet

ModuleNotFoundError: No module named 'hydra'
ModuleNotFoundError: No module named 'eff'

Below is a reproducible recipe and full environment dump.

1. Hardware

GPU: NVIDIA GeForce RTX 3090 (Driver 575.51.03, CUDA 12.9)
CPU: Intel i9-12900K (16 cores / 24 threads)
RAM: 125 GiB
OS: Ubuntu 24.04.1 LTS (Linux 6.8.0-59-generic)

2. Software

Docker: 28.1.1
nvidia-container-toolkit: 1.11.0
NGC CLI: 3.64.4
TAO Toolkit container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
Local TAO Launcher: installed via PyPI (tao-cli) in conda pcn env
Python (conda env “pcn”): 3.8.20
- nvidia-tao-pytorch==5.1.0
- torch==2.1.2+cu118
- numpy==1.24.4

3. TAO Toolkit Version

$ docker run --rm nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt --version
# (Note: container prints “TAO Toolkit Version 5.5.0”)

4. Experiment Spec

# /home/developer/tao/specs/experiment_pcn.yaml
results_dir: "/results/nvidia"
encryption_key: nvidia_tao

model:
  model_type: ST-GCN
  # …

dataset:
  train_dataset:
    data_path: "/data/nvidia/train_data.npy"
    label_path: "/data/nvidia/train_label.pkl"
  # …

train:
  optim:
    lr: 0.1
    # …

dataset_convert:
  pose_type: "3dbp"
  num_joints: 34
  # …

5. Steps to Reproduce

a) Download pretrained checkpoint

export KEY=<YOUR_NGC_API_KEY>
mkdir -p ~/tao/results/pretrained
ngc registry model download-version \
  nvidia/tao/poseclassificationnet:trainable_v1.0 \
  --dest ~/tao/results/pretrained
export PRETRAIN=~/tao/results/pretrained/poseclassificationnet_vtrainable_v1.0/model.step-049.tlt

b) Local `tao` launcher (conda `pcn` env)

export SPECS_DIR=~/tao/specs
export RESULTS_DIR=~/tao/results

tao model pose_classification train \
  -e "$SPECS_DIR/experiment_pcn.yaml" \
  -r "$RESULTS_DIR/finetune_run" \
  -k "$KEY" \
  --gpus 1 \
  resume_training_checkpoint_path="$PRETRAIN"

Error:

usage: train.py [--help] [--hydra-help] [--version] [--cfg {job,hydra,all}] [--resolve] [--package PACKAGE]
                [--run] [--multirun] [--shell-completion] [--config-path CONFIG_PATH] [--config-name CONFIG_NAME]
                [--config-dir CONFIG_DIR] [--info [{all,config,defaults,defaults-tree,plugins,searchpath}]]
                [overrides [overrides ...]]
train.py: error: unrecognized arguments: --gpus 1 resume_training_checkpoint_path=
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: 'str' object has no attribute 'decode'
Execution status: FAIL

c) Container approach still fails locating the `tao` binary

docker run --gpus all --rm \
  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  -e KEY="$KEY" \
  -v ~/.ngc:/root/.ngc \
  -v ~/tao/data:/workspace/data \
  -v ~/tao/specs:/workspace/specs \
  -v ~/tao/results:/workspace/results \
  nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt \
  tao model pose_classification train \
    -e /workspace/specs/experiment_pcn.yaml \
    -r /workspace/results/finetune_run \
    -k $KEY --gpus 1 \
    resume_training_checkpoint_path=/workspace/results/pretrained/poseclassificationnet_vtrainable_v1.0/model.step-049.tlt

/opt/nvidia/nvidia_entrypoint.sh: line 55: exec: tao: not found

6. Questions

Local CLI: Why is train.py treating --gpus and resume_training_checkpoint_path= as unknown args?
Container: Where is the tao entrypoint inside 5.5.0-pyt, and how can I invoke it properly?
Dependencies: What’s the correct way to install/configure hydra, eff, and the encryption helpers for Pose Classification?

Thank you for your guidance!

6. Full Environment Dump

(see attached text file or paste the output of the diagnostic script)

$ ./dump_env.sh
=== OS & Kernel ===
Distributor ID: Ubuntu
Description:    Ubuntu 24.04.1 LTS
Release:        24.04
Codename:       noble
Linux 6.8.0-59-generic

=== CPU ===
…  

=== RAM ===
…  

=== Disk usage, lsblk, lspci, nvidia-smi, docker --version, nvidia-container-cli info …

=== TAO Launcher & NGC CLI ===
tao: usage: …  (no `info --verbose` subcommand)
NGC CLI 3.64.4

=== Conda Environments & Python ===
Python 3.8.20  
conda envs:  
  base, mynatek-yolo, pcn (*), pcn-env, tao_env, …

Morganh · May 13, 2025, 3:10am

You do not need to install tao-launcher when you run docker run xxx.
That means you can ignore tao-launcher.

After you run docker run xxx, you can run command without tao model in the beginning. For example,
pose_classification train xxx

Topic		Replies	Views
Tao model error TAO Toolkit	9	119	October 21, 2024
Tao model action_recognition train error in the notebook TAO Toolkit	6	509	February 9, 2024
Fine Tuning DINO Retail Object detector - error out as it expects unspecified/unknown configurations TAO Toolkit cudnn , retail-object-detection	6	45	December 30, 2024
Error when pulling a tao-toolkit docker file TAO Toolkit	14	726	July 24, 2023
Error in classification_pyt train TAO Toolkit tao	13	604	January 5, 2024
Error in TAO-Toolkit while training TAO Toolkit	2	1112	January 4, 2022
Tao classification command not pulling the correct version TAO Toolkit	8	714	March 10, 2022
Tao classification train -e ./specs/classification_spec.cfg -r ./ -k error TAO Toolkit	31	1293	December 7, 2021
LPRNet Error TAO Toolkit	13	229	June 19, 2024
TAO 5.0.0. TF1 Container fail to run tao model yolo_v4 dataset_convert command TAO Toolkit	4	355	October 5, 2023