Issue Running Inference on NVIDIA TAO Retail Object Recognition Model

Hello,

I’m trying to evaluate the Retail Object Recognition model from NVIDIA TAO to see if it fits my needs. My goal is to run inference using the pretrained model, but I’ve encountered issues along the way. My current hardware used is 1 NVIDIA A100

Steps Taken:

I followed the official tutorial from NVIDIA: Retail Object Recognition Notebook

However, the notebook is primarily focused on transfer learning, and I couldn’t find clear instructions on how to directly test the pretrained model.

I downloaded the model using:

!ngc registry model download-version nvidia/tao/retail_object_recognition:trainable_head_fan_base_v2.0 --dest $HOST_MODEL_DIR/

I modified the infer.yaml file as follows:

results_dir: "???"
model:
  backbone: **fan_base**
  input_width: 224
  input_height: 224
  feat_dim: 1024
dataset:
  workers: 8
  val_dataset:
    reference: "???"
    query: ""
inference:
  inference_input_type: classification_folder
  input_path: "???"
  batch_size: 16

I attempted to run inference with:

# run inference on known classes
! tao model ml_recog inference \
                    -e $SPECS_DIR/infer.yaml \
                    results_dir=$RESULTS_DIR \
                    inference.checkpoint=$MODEL_DIR/retail_object_recognition_vtrainable_head_fan_base_v2.0/retail_object_recognition_head_fan_base_v2.0.pth \
                    dataset.val_dataset.reference=$DATA_DIR/$DATA_FOLDER/known_classes/reference \
                    inference.input_path=$DATA_DIR/$DATA_FOLDER/known_classes/test 

Encountered Errors:

I received the following error:
KeyError: ‘pytorch-lightning_version’

It seems that the checkpoint file lacks the required pytorch-lightning_version key. I attempted to manually modify the checkpoint by loading it in PyTorch and adding:

new_ckpt['pytorch-lightning_version'] = '0.0.0'
new_ckpt['global_step'] = None
new_ckpt['epoch'] = None

However, this did not resolve the issue. The model’s state_dict is missing several keys, and adding them manually does not work. The error I receive is:

RuntimeError: Error(s) in loading state_dict for MLRecogModel:
Missing key(s) in state_dict: "model.embedder.classifier_feat.0.weight", "model.embedder.classifier_feat.0.bias", ... etc...

Is this model intended only for fine-tuning, or should it work for direct inference? I do not want to train, I just want to see if it fits my needs and then perform a finetune.

Thank you in advance.

Could you please check if https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_recognition/files?version=trainable_v1.0 works? Thanks.

Thanks for the answer,

I have tried downloading that model and using it but without any luck

# run inference on known classes
! tao model ml_recog inference \
                    -e $SPECS_DIR/infer.yaml \
                    results_dir=$RESULTS_DIR \
                    inference.checkpoint=$MODEL_DIR/retail_object_recognition.pth \
                    dataset.val_dataset.reference=$DATA_DIR/$DATA_FOLDER/known_classes/reference \
                    inference.input_path=$DATA_DIR/$DATA_FOLDER/known_classes/test 

This is my output log, I’ve tried adding the pytorch-lightning_version and etc… But, state_dict is still missing.

2025-02-19 10:50:17,230 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-02-19 10:50:17,318 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2025-02-19 10:50:17,334 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/gmdo/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2025-02-19 10:50:17,334 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2025-02-19 10:50:24,921 - TAO Toolkit - faiss.loader - INFO] Loading faiss with AVX2 support.
[2025-02-19 10:50:24,942 - TAO Toolkit - faiss.loader - INFO] Successfully loaded faiss with AVX2 support.
[2025-02-19 10:50:25,813 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
INFO: Loading faiss with AVX2 support.
INFO: Successfully loaded faiss with AVX2 support.
sys:1: UserWarning: 
'infer.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning: 
'infer.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results/inference/status.json
  rank_zero_warn(
Inference results will be saved at: /results/inference
Experiment configuration:
encryption_key: '****'
results_dir: /results/inference
wandb:
  enable: true
  project: TAO Toolkit
  entity: ''
  tags:
  - training
  - tao-toolkit
  reinit: false
  sync_tensorboard: false
  save_code: false
  name: TAO Toolkit training experiment
train:
  num_gpus: 1
  gpu_ids:
  - 0
  num_nodes: 1
  seed: 1234
  cudnn:
    benchmark: false
    deterministic: true
  num_epochs: 10
  checkpoint_interval: 1
  validation_interval: 1
  resume_training_checkpoint_path: null
  results_dir: null
  optim:
    name: Adam
    steps:
    - 40
    - 70
    gamma: 0.1
    warmup_factor: 0.01
    warmup_iters: 10
    warmup_method: linear
    triplet_loss_margin: 0.3
    embedder:
      bias_lr_factor: 1.0
      base_lr: 0.00035
      momentum: 0.9
      weight_decay: 0.0005
      weight_decay_bias: 0.0005
    trunk:
      bias_lr_factor: 1.0
      base_lr: 0.00035
      momentum: 0.9
      weight_decay: 0.0005
      weight_decay_bias: 0.0005
    miner_function_margin: 0.1
  clip_grad_norm: 0.0
  report_accuracy_per_class: true
  smooth_loss: true
  batch_size: 4
  val_batch_size: 4
  train_trunk: true
  train_embedder: true
model:
  backbone: nvdinov2_vit_large_legacy
  pretrained_model_path: null
  pretrained_trunk_path: null
  pretrained_embedder_path: null
  input_width: 224
  input_height: 224
  input_channels: 3
  feat_dim: 1024
evaluate:
  num_gpus: 1
  gpu_ids:
  - 0
  num_nodes: 1
  checkpoint: ???
  results_dir: null
  trt_engine: null
  topk: 1
  batch_size: 4
  report_accuracy_per_class: true
dataset:
  train_dataset: null
  val_dataset:
    reference: /data/retail-product-checkout-dataset_classification_demo/known_classes/reference
    query: ''
  workers: 8
  class_map: null
  pixel_mean:
  - 0.485
  - 0.456
  - 0.406
  pixel_std:
  - 0.226
  - 0.226
  - 0.226
  prob: 0.5
  re_prob: 0.5
  gaussian_blur:
    enabled: true
    kernel:
    - 15
    - 15
    sigma:
    - 0.3
    - 0.7
  color_augmentation:
    enabled: true
    brightness: 0.5
    contrast: 0.3
    saturation: 0.1
    hue: 0.1
  random_rotation: false
  num_instance: 4
export:
  batch_size: -1
  checkpoint: null
  gpu_id: 0
  onnx_file: null
  on_cpu: false
  opset_version: 14
  verbose: true
  results_dir: null
gen_trt_engine:
  results_dir: null
  gpu_id: 0
  onnx_file: ???
  trt_engine: null
  batch_size: -1
  verbose: true
  tensorrt:
    data_type: FP32
    workspace_size: 1024
    min_batch_size: 1
    opt_batch_size: 1
    max_batch_size: 1
    calibration:
      cal_cache_file: null
      cal_batch_size: 1
      cal_batches: 1
cal_image_dir: []inference:
  num_gpus: 1
  gpu_ids:
  - 0
  num_nodes: 1
  checkpoint: /model/retail_object_recognition.pth
  results_dir: /results/inference
  trt_engine: null
  input_path: /data/retail-product-checkout-dataset_classification_demo/known_classes/test
  inference_input_type: classification_folder
  batch_size: 16
  topk: 1

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
Error executing job with overrides: ['results_dir=/results', 'inference.checkpoint=/model/retail_object_recognition.pth', 'dataset.val_dataset.reference=/data/retail-product-checkout-dataset_classification_demo/known_classes/reference', 'inference.input_path=/data/retail-product-checkout-dataset_classification_demo/known_classes/test']Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/metric_learning_recognition/scripts/inference.py", line 76, in main
    run_experiment(experiment_config=cfg)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/metric_learning_recognition/scripts/inference.py", line 42, in run_experiment
    metric_learning_recognition = MLRecogModel.load_from_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/model_helpers.py", line 125, in wrapper
    return self.method(cls, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1581, in load_from_checkpoint
    loaded = _load_from_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/saving.py", line 66, in _load_from_checkpoint
    checkpoint = _pl_migrate_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/migration/utils.py", line 143, in _pl_migrate_checkpoint
    old_version = _get_version(checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/migration/utils.py", line 164, in _get_version
return checkpoint["pytorch-lightning_version"]KeyError: 'pytorch-lightning_version'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2025-02-19 10:50:44,088 - TAO Toolkit - root - INFO] Sending telemetry data.
[2025-02-19 10:50:44,088 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2025-02-19 10:50:44,088 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'inference', 'network': 'metric_learning_recognition', 'gpu': ['Tesla-T4'], 'success': False, 'time_lapsed': 14} to https://api.tao.ngc.nvidia.com.
[2025-02-19 10:50:44,734 - TAO Toolkit - root - INFO] Failed with reponse: error 502
[2025-02-19 10:50:44,734 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2025-02-19 10:50:44,735 - TAO Toolkit - root - WARNING] Execution status: FAIL

2025-02-19 10:50:46,822 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

We will check further and update to you.

1 Like

It is loading model.pretrained_embedder_path, this retail_object_recognition_head_fan_base_v2.0.pth are torch weights for embedder module.
The inference.checkpoint should be outputs from training, which we can obtain through training at least one epoch.

Here’s the suggested config for your case. Please download the weights for trunk at

!ngc registry model download-version nvidia/tao/pretrained_fan_classification_nvimagenet:fan_base_hybrid_nvimagenet --dest $HOST_MODEL_DIR/

Then could you please try below spec file?

results_dir: "???"
model:
  backbone: fan_base
  pretrained_model_path: null
  pretrained_trunk_path: /model/pretrained_fan_classification_nvimagenet_vfan_base_hybrid_nvimagenet/fan_base_hybrid_nvimagenet.pth
  pretrained_embedder_path: /model/retail_object_recognition_vtrainable_head_fan_base_v2.0/retail_object_recognition_head_fan_base_v2.0.pth
  input_width: 224
  input_height: 224 
  feat_dim: 1024
train:
  train_trunk: False
  train_embedder: True
  optim:
    name: Adam
    steps: [40, 70]
    gamma: 0.1
    embedder:
      bias_lr_factor: 1
      weight_decay: 0.001
      weight_decay_bias: 0.0005
      base_lr: 1e-3
      momentum: 0.9
    trunk:
      bias_lr_factor: 1
      weight_decay: 0.0001
      weight_decay_bias: 0.0005
      base_lr: 1e-3
      momentum: 0.9
    warmup_factor: 0.01
    warmup_iters: 10
    warmup_method: linear
    triplet_loss_margin: 0.3
    miner_function_margin: 0.1
  num_epochs: 1
  resume_training_checkpoint_path: null
  checkpoint_interval: 1
  smooth_loss: False
  batch_size: 16
  val_batch_size: 16
dataset:
  train_dataset: "???"
  val_dataset:
    reference: "???"
    query: "???"
  workers: 12
  pixel_mean: [0.485, 0.456, 0.406]
  pixel_std: [0.226, 0.226, 0.226]
  prob: 0.5
  re_prob: 0.5
  num_instance: 4
  color_augmentation: 
    enabled: True
  gaussian_blur:
    enabled: True
inference:
  inference_input_type: classification_folder
  input_path: "???"
  batch_size: 16