Tao setup with fine tunning of PCN

Hello TAO team,

I’m trying to fine-tune the Pose Classification Net using the TAO Toolkit PyTorch container, but I keep running into two sets of problems:

  1. When I invoke tao inside the container I get

/opt/nvidia/nvidia_entrypoint.sh: line 55: exec: tao: not found

  1. When I manage to call the local launcher CLI (pose_classification) instead of the container, I hit missing-module errors:

vbnet

ModuleNotFoundError: No module named 'hydra'
ModuleNotFoundError: No module named 'eff'

Below is a reproducible recipe and full environment dump.


1. Hardware

  • GPU: NVIDIA GeForce RTX 3090 (Driver 575.51.03, CUDA 12.9)
  • CPU: Intel i9-12900K (16 cores / 24 threads)
  • RAM: 125 GiB
  • OS: Ubuntu 24.04.1 LTS (Linux 6.8.0-59-generic)

2. Software

  • Docker: 28.1.1
  • nvidia-container-toolkit: 1.11.0
  • NGC CLI: 3.64.4
  • TAO Toolkit container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
  • Local TAO Launcher: installed via PyPI (tao-cli) in conda pcn env
  • Python (conda env “pcn”): 3.8.20
    • nvidia-tao-pytorch==5.1.0
    • torch==2.1.2+cu118
    • numpy==1.24.4

3. TAO Toolkit Version

$ docker run --rm nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt --version
# (Note: container prints “TAO Toolkit Version 5.5.0”) 

4. Experiment Spec

# /home/developer/tao/specs/experiment_pcn.yaml
results_dir: "/results/nvidia"
encryption_key: nvidia_tao

model:
  model_type: ST-GCN
  # …

dataset:
  train_dataset:
    data_path: "/data/nvidia/train_data.npy"
    label_path: "/data/nvidia/train_label.pkl"
  # …

train:
  optim:
    lr: 0.1
    # …

dataset_convert:
  pose_type: "3dbp"
  num_joints: 34
  # …

5. Steps to Reproduce

a) Download pretrained checkpoint

export KEY=<YOUR_NGC_API_KEY>
mkdir -p ~/tao/results/pretrained
ngc registry model download-version \
  nvidia/tao/poseclassificationnet:trainable_v1.0 \
  --dest ~/tao/results/pretrained
export PRETRAIN=~/tao/results/pretrained/poseclassificationnet_vtrainable_v1.0/model.step-049.tlt

b) Local tao launcher (conda pcn env)

export SPECS_DIR=~/tao/specs
export RESULTS_DIR=~/tao/results

tao model pose_classification train \
  -e "$SPECS_DIR/experiment_pcn.yaml" \
  -r "$RESULTS_DIR/finetune_run" \
  -k "$KEY" \
  --gpus 1 \
  resume_training_checkpoint_path="$PRETRAIN"

Error:

usage: train.py [--help] [--hydra-help] [--version] [--cfg {job,hydra,all}] [--resolve] [--package PACKAGE]
                [--run] [--multirun] [--shell-completion] [--config-path CONFIG_PATH] [--config-name CONFIG_NAME]
                [--config-dir CONFIG_DIR] [--info [{all,config,defaults,defaults-tree,plugins,searchpath}]]
                [overrides [overrides ...]]
train.py: error: unrecognized arguments: --gpus 1 resume_training_checkpoint_path=
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: 'str' object has no attribute 'decode'
Execution status: FAIL

c) Container approach still fails locating the tao binary

docker run --gpus all --rm \
  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  -e KEY="$KEY" \
  -v ~/.ngc:/root/.ngc \
  -v ~/tao/data:/workspace/data \
  -v ~/tao/specs:/workspace/specs \
  -v ~/tao/results:/workspace/results \
  nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt \
  tao model pose_classification train \
    -e /workspace/specs/experiment_pcn.yaml \
    -r /workspace/results/finetune_run \
    -k $KEY --gpus 1 \
    resume_training_checkpoint_path=/workspace/results/pretrained/poseclassificationnet_vtrainable_v1.0/model.step-049.tlt
/opt/nvidia/nvidia_entrypoint.sh: line 55: exec: tao: not found

6. Questions

  1. Local CLI: Why is train.py treating --gpus and resume_training_checkpoint_path= as unknown args?
  2. Container: Where is the tao entrypoint inside 5.5.0-pyt, and how can I invoke it properly?
  3. Dependencies: What’s the correct way to install/configure hydra, eff, and the encryption helpers for Pose Classification?

Thank you for your guidance!

6. Full Environment Dump

(see attached text file or paste the output of the diagnostic script)

$ ./dump_env.sh
=== OS & Kernel ===
Distributor ID: Ubuntu
Description:    Ubuntu 24.04.1 LTS
Release:        24.04
Codename:       noble
Linux 6.8.0-59-generic

=== CPU ===
…  

=== RAM ===
…  

=== Disk usage, lsblk, lspci, nvidia-smi, docker --version, nvidia-container-cli info …

=== TAO Launcher & NGC CLI ===
tao: usage: …  (no `info --verbose` subcommand)
NGC CLI 3.64.4

=== Conda Environments & Python ===
Python 3.8.20  
conda envs:  
  base, mynatek-yolo, pcn (*), pcn-env, tao_env, …

You do not need to install tao-launcher when you run docker run xxx.
That means you can ignore tao-launcher.

After you run docker run xxx, you can run command without tao model in the beginning. For example,
pose_classification train xxx

1 Like

tao inside docker making issues of installing tao for PCN

inside that container I ran the command of docker:

docker run --gpus all -it --rm \
  -v $SPECS_DIR:/workspace/specs \
  -v $RESULTS_DIR:/workspace/results \
  nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt \
  tao model pose_classification train \
    -e /workspace/specs/experiment_pcn.yaml \
    -r /workspace/results/finetune_run \
    -k "56789087654" \
    resume_training_checkpoint_path="/workspace/results/pretrained/poseclassificationnet_vtrainable_v1.0/st-gcn_3dbp_nvidia.tlt"

but it getting issues no bash file or docker file.

even the image is up to date but still getting issues.

Please delete tao model.

Modify to below.

docker run --gpus all -it --rm \
  -v $SPECS_DIR:/workspace/specs \
  -v $RESULTS_DIR:/workspace/results \
  nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt \
  pose_classification train \
    -e /workspace/specs/experiment_pcn.yaml \
    -r /workspace/results/finetune_run \
    -k "56789087654" \    
resume_training_checkpoint_path="/workspace/results/pretrained/poseclassificationnet_vtrainable_v1.0/st-gcn_3dbp_nvidia.tlt"

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.