I found in your document that the Classifier_pyt model supports the backbone of the ResNet series, but I couldn’t find where to download it. I’m here GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC +Classification also found three pre trained weights, but there was no ResNet among them. I have also encountered these problems in other models, where some backbones can be searched while others cannot be found
I downloaded resnet18 for Classifier_tf1, but I can’t use it. It’s based on TensorFlow architecture, while Classifier_pyt is based on PyTorch architecture
That is to say, Classifier_pyt supports the ResNet variant as the backbone, but currently there is no pre trained model available. Can one use the Imagenetnet dataset to train a pre trained model themselves?
Due to copyright issues, we can’t provide the ImageNet dataset or any ImageNet-pretrained models in TAO Toolkit. User can follow the blog to prepare dataset and train a pretrained model.
I trained a pre trained model using the ResNet18 backbone and the ImageNet_100 dataset, but when I used this trained model as a pre trained model to train my own data, I reported an error
RuntimeError: Error(s) in loading state_dict for ResNet:
size mismatch for fc.bias: copying a param with shape torch.Size([100]) from checkpoint, the shape in current model is torch.Size([2]).e([2, 512]).
This is a category mismatch between the pre trained model and my dataset. How can I solve it? If the training stops, the container will be deleted and the code cannot be modified to load the model in non strict mode
You can login the docker to modify the code. Use below way to trigger tao pyt docker.
$ docker run --runtime=nvidia --rm -it nvcr.io/nvidia/tao/tao-toolkit:6.0.0-pyt /bin/bash
Then, the source code can also be modified as well. The code is under /usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/ .
Thank you. I created a new container and modified the code to remove the FC layer weights (classification headers) during model loading, leaving only the backbone. I trained and exported the model in the container without any issues. But why doesn’t your official model have this problem? My pre trained model I trained myself has this problem. Can you solve this problem in future versions
Yes, I am using nvcr.io/nvidia/tao/tao toolkit: 6.25.9-pyt. The problem is that if the dataset used to train the pre trained model is 100 classes, using it as the pre trained model to train a 2-class dataset will result in an error:
RuntimeError: Error(s) in loading state_dict for ResNet:
size mismatch for fc.bias: copying a param with shape torch.Size([100]) from checkpoint, the shape in current model is torch.Size([2]).e([2, 512]).
You are a pre trained model and the data you are preparing to train belong to the same category, so this error will not be reported
Yes, I can reproduce the error now when run with below steps.
Comment out the line pretrained_backbone_path. Set num_classes: 2. Train 2 classes(cat and dogs) dataset mentioned in the notebook. After 1 or 2 epochs, get the latest result ./results_train/train/classifier_model_latest.pth.
Set pretrained_backbone_path: ./results_train/train/classifier_model_latest.pth. Set num_classes: 3.
Also change the dataset to 3 classes. I generate a dummy “owl” dataset by copying from the “cat” dataset.
The workaround is to generate a new backbone_only.pt based on the classifier_model_latest.pth.
import torch
# Load the checkpoint (disable weights_only, only do this if you trust the file!)
ckpt = torch.load('classifier_model_latest.pth', map_location='cpu', weights_only=False)
# Extract the model's state_dict
state_dict = ckpt['state_dict']
# Keep only backbone parameters (exclude parameters starting with 'model.head.')
backbone_state_dict = {k: v for k, v in state_dict.items() if not k.startswith('model.head.')}
# Save the filtered state_dict as a new .pth file
torch.save(backbone_state_dict, 'backbone_only.pth')
print('Saved backbone_only.pth, contains only backbone parameters. Count:', len(backbone_state_dict))
print('Example parameter names:', list(backbone_state_dict.keys())[:1000])
In other words, TAO’s official model works because the classification head parameter names do not match the new model, so they are skipped. Your own model fails because the parameter names match, but the shapes differ, triggering a shape mismatch error.
BTW, you can check the parameter names against both the official model or your own model via below script.