How to download ResNet pre trained weights for Classification_pyt

• Hardware
ubuntu22.04
tao toolkit6.25.9
NVIDIA version 535.183.01
RTX4090
• Network Type (Classification_pyt)

I found in your document that the Classifier_pyt model supports the backbone of the ResNet series, but I couldn’t find where to download it. I’m here GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC +Classification also found three pre trained weights, but there was no ResNet among them. I have also encountered these problems in other models, where some backbones can be searched while others cannot be found

I downloaded resnet18 for Classifier_tf1, but I can’t use it. It’s based on TensorFlow architecture, while Classifier_pyt is based on PyTorch architecture

@Morganh
Hello, could you please take a look

For Resnet Variants, there are not available pretrained models dedicated for classfication_pyt network. One suggestion is to prepare Imagenetnet dataset and train with classfication_pyt network. Some tips may be helpful in Preparing State-of-the-Art Models for Classification and Object Detection with NVIDIA TAO Toolkit | NVIDIA Technical Blog.

That is to say, Classifier_pyt supports the ResNet variant as the backbone, but currently there is no pre trained model available. Can one use the Imagenetnet dataset to train a pre trained model themselves?

Classification_pyt supports ResNet variant and also some other kinds of backbones. Refer to Image Classification PyT — Tao Toolkit. Some variants have pretrained models, for example, Fan Variants: GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC, GCVit Variants: GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC, etc.

Due to copyright issues, we can’t provide the ImageNet dataset or any ImageNet-pretrained models in TAO Toolkit. User can follow the blog to prepare dataset and train a pretrained model.

I trained a pre trained model using the ResNet18 backbone and the ImageNet_100 dataset, but when I used this trained model as a pre trained model to train my own data, I reported an error

RuntimeError: Error(s) in loading state_dict for ResNet:
size mismatch for fc.bias: copying a param with shape torch.Size([100]) from checkpoint, the shape in current model is torch.Size([2]).e([2, 512]).

This is a category mismatch between the pre trained model and my dataset. How can I solve it? If the training stops, the container will be deleted and the code cannot be modified to load the model in non strict mode

You can login the docker to modify the code. Use below way to trigger tao pyt docker.
$ docker run --runtime=nvidia --rm -it nvcr.io/nvidia/tao/tao-toolkit:6.0.0-pyt /bin/bash

Then, the source code can also be modified as well. The code is under /usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/ .

Thank you. I created a new container and modified the code to remove the FC layer weights (classification headers) during model loading, leaving only the backbone. I trained and exported the model in the container without any issues. But why doesn’t your official model have this problem? My pre trained model I trained myself has this problem. Can you solve this problem in future versions

Do you mean there is not issue if using the official model? Which official model did you use?

Yes, I have tested several variants of FAN, GCViT, and FasterViT without any issues

I cannot reproduce the problem. I think you may be using nvcr.io/nvidia/tao/tao-toolkit:6.25.9-pyt because it can support resnet_18. The nvcr.io/nvidia/tao/tao-toolkit:6.0.0-pyt cannot support resnet_18.

My step:

  1. login nvcr.io/nvidia/tao/tao-toolkit:6.25.9-pyt and train with the cat_and_dog dataset mentioned in the notebook.
  2. Do not set pretrained_backbone_path
  3. After several epochs, get the trained result: ./results_train/train/classifier_model_latest.pth
  4. Then set the pretrained_backbone_path: ./results_train/train/classifier_model_latest.pth
  5. The training can run smoothly without any issue.

Please double check and share the info if you have.

Yes, I am using nvcr.io/nvidia/tao/tao toolkit: 6.25.9-pyt. The problem is that if the dataset used to train the pre trained model is 100 classes, using it as the pre trained model to train a 2-class dataset will result in an error:
RuntimeError: Error(s) in loading state_dict for ResNet:
size mismatch for fc.bias: copying a param with shape torch.Size([100]) from checkpoint, the shape in current model is torch.Size([2]).e([2, 512]).
You are a pre trained model and the data you are preparing to train belong to the same category, so this error will not be reported

Yes, I can reproduce the error now when run with below steps.

  1. Comment out the line pretrained_backbone_path. Set num_classes: 2. Train 2 classes(cat and dogs) dataset mentioned in the notebook. After 1 or 2 epochs, get the latest result ./results_train/train/classifier_model_latest.pth.
  2. Set pretrained_backbone_path: ./results_train/train/classifier_model_latest.pth. Set num_classes: 3.
    Also change the dataset to 3 classes. I generate a dummy “owl” dataset by copying from the “cat” dataset.

The workaround is to generate a new backbone_only.pt based on the classifier_model_latest.pth.

import torch

# Load the checkpoint (disable weights_only, only do this if you trust the file!)
ckpt = torch.load('classifier_model_latest.pth', map_location='cpu', weights_only=False)

# Extract the model's state_dict
state_dict = ckpt['state_dict']

# Keep only backbone parameters (exclude parameters starting with 'model.head.')
backbone_state_dict = {k: v for k, v in state_dict.items() if not k.startswith('model.head.')}

# Save the filtered state_dict as a new .pth file
torch.save(backbone_state_dict, 'backbone_only.pth')

print('Saved backbone_only.pth, contains only backbone parameters. Count:', len(backbone_state_dict))
print('Example parameter names:', list(backbone_state_dict.keys())[:1000])

The reason why you cannot see the issue from the official model is that
'head.fc.weight', 'head.fc.bias'] are skipped when run tao_pytorch_backend/nvidia_tao_pytorch/cv/classification_pyt/model/classifier.py at main · NVIDIA/tao_pytorch_backend · GitHub.

You can find below log.

_IncompatibleKeys(missing_keys=['head.weight', 'head.bias'], unexpected_keys=['head.fc.weight', 'head.fc.bias'])GPU available: True (cuda), used: True.fc.bias']) (classifier.py:69)

In other words, TAO’s official model works because the classification head parameter names do not match the new model, so they are skipped. Your own model fails because the parameter names match, but the shapes differ, triggering a shape mismatch error.

BTW, you can check the parameter names against both the official model or your own model via below script.

import torch
state_dict_official = torch.load('fan_hybrid_small.pth')
print(state_dict_official.keys())
print(state_dict_official['state_dict'].keys())

1 Like

Okay, thank you. My problem has been resolved

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.