Thanks for the answer,
I have tried downloading that model and using it but without any luck
# run inference on known classes
! tao model ml_recog inference \
-e $SPECS_DIR/infer.yaml \
results_dir=$RESULTS_DIR \
inference.checkpoint=$MODEL_DIR/retail_object_recognition.pth \
dataset.val_dataset.reference=$DATA_DIR/$DATA_FOLDER/known_classes/reference \
inference.input_path=$DATA_DIR/$DATA_FOLDER/known_classes/test
This is my output log, I’ve tried adding the pytorch-lightning_version and etc… But, state_dict is still missing.
2025-02-19 10:50:17,230 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-02-19 10:50:17,318 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2025-02-19 10:50:17,334 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/gmdo/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2025-02-19 10:50:17,334 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2025-02-19 10:50:24,921 - TAO Toolkit - faiss.loader - INFO] Loading faiss with AVX2 support.
[2025-02-19 10:50:24,942 - TAO Toolkit - faiss.loader - INFO] Successfully loaded faiss with AVX2 support.
[2025-02-19 10:50:25,813 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
INFO: Loading faiss with AVX2 support.
INFO: Successfully loaded faiss with AVX2 support.
sys:1: UserWarning:
'infer.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning:
'infer.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
_run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results/inference/status.json
rank_zero_warn(
Inference results will be saved at: /results/inference
Experiment configuration:
encryption_key: '****'
results_dir: /results/inference
wandb:
enable: true
project: TAO Toolkit
entity: ''
tags:
- training
- tao-toolkit
reinit: false
sync_tensorboard: false
save_code: false
name: TAO Toolkit training experiment
train:
num_gpus: 1
gpu_ids:
- 0
num_nodes: 1
seed: 1234
cudnn:
benchmark: false
deterministic: true
num_epochs: 10
checkpoint_interval: 1
validation_interval: 1
resume_training_checkpoint_path: null
results_dir: null
optim:
name: Adam
steps:
- 40
- 70
gamma: 0.1
warmup_factor: 0.01
warmup_iters: 10
warmup_method: linear
triplet_loss_margin: 0.3
embedder:
bias_lr_factor: 1.0
base_lr: 0.00035
momentum: 0.9
weight_decay: 0.0005
weight_decay_bias: 0.0005
trunk:
bias_lr_factor: 1.0
base_lr: 0.00035
momentum: 0.9
weight_decay: 0.0005
weight_decay_bias: 0.0005
miner_function_margin: 0.1
clip_grad_norm: 0.0
report_accuracy_per_class: true
smooth_loss: true
batch_size: 4
val_batch_size: 4
train_trunk: true
train_embedder: true
model:
backbone: nvdinov2_vit_large_legacy
pretrained_model_path: null
pretrained_trunk_path: null
pretrained_embedder_path: null
input_width: 224
input_height: 224
input_channels: 3
feat_dim: 1024
evaluate:
num_gpus: 1
gpu_ids:
- 0
num_nodes: 1
checkpoint: ???
results_dir: null
trt_engine: null
topk: 1
batch_size: 4
report_accuracy_per_class: true
dataset:
train_dataset: null
val_dataset:
reference: /data/retail-product-checkout-dataset_classification_demo/known_classes/reference
query: ''
workers: 8
class_map: null
pixel_mean:
- 0.485
- 0.456
- 0.406
pixel_std:
- 0.226
- 0.226
- 0.226
prob: 0.5
re_prob: 0.5
gaussian_blur:
enabled: true
kernel:
- 15
- 15
sigma:
- 0.3
- 0.7
color_augmentation:
enabled: true
brightness: 0.5
contrast: 0.3
saturation: 0.1
hue: 0.1
random_rotation: false
num_instance: 4
export:
batch_size: -1
checkpoint: null
gpu_id: 0
onnx_file: null
on_cpu: false
opset_version: 14
verbose: true
results_dir: null
gen_trt_engine:
results_dir: null
gpu_id: 0
onnx_file: ???
trt_engine: null
batch_size: -1
verbose: true
tensorrt:
data_type: FP32
workspace_size: 1024
min_batch_size: 1
opt_batch_size: 1
max_batch_size: 1
calibration:
cal_cache_file: null
cal_batch_size: 1
cal_batches: 1
cal_image_dir: []inference:
num_gpus: 1
gpu_ids:
- 0
num_nodes: 1
checkpoint: /model/retail_object_recognition.pth
results_dir: /results/inference
trt_engine: null
input_path: /data/retail-product-checkout-dataset_classification_demo/known_classes/test
inference_input_type: classification_folder
batch_size: 16
topk: 1
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Error executing job with overrides: ['results_dir=/results', 'inference.checkpoint=/model/retail_object_recognition.pth', 'dataset.val_dataset.reference=/data/retail-product-checkout-dataset_classification_demo/known_classes/reference', 'inference.input_path=/data/retail-product-checkout-dataset_classification_demo/known_classes/test']Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
raise e
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
runner(cfg, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/metric_learning_recognition/scripts/inference.py", line 76, in main
run_experiment(experiment_config=cfg)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/metric_learning_recognition/scripts/inference.py", line 42, in run_experiment
metric_learning_recognition = MLRecogModel.load_from_checkpoint(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/model_helpers.py", line 125, in wrapper
return self.method(cls, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1581, in load_from_checkpoint
loaded = _load_from_checkpoint(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/saving.py", line 66, in _load_from_checkpoint
checkpoint = _pl_migrate_checkpoint(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/migration/utils.py", line 143, in _pl_migrate_checkpoint
old_version = _get_version(checkpoint)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/migration/utils.py", line 164, in _get_version
return checkpoint["pytorch-lightning_version"]KeyError: 'pytorch-lightning_version'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2025-02-19 10:50:44,088 - TAO Toolkit - root - INFO] Sending telemetry data.
[2025-02-19 10:50:44,088 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2025-02-19 10:50:44,088 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'inference', 'network': 'metric_learning_recognition', 'gpu': ['Tesla-T4'], 'success': False, 'time_lapsed': 14} to https://api.tao.ngc.nvidia.com.
[2025-02-19 10:50:44,734 - TAO Toolkit - root - INFO] Failed with reponse: error 502
[2025-02-19 10:50:44,734 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2025-02-19 10:50:44,735 - TAO Toolkit - root - WARNING] Execution status: FAIL
2025-02-19 10:50:46,822 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.