Could you please share the training spec file when you get above intermediate result? It looks promising. You can try to train with more epochs/iterations. Also, more experiments like changing the backbone can also be tried.
Morgan, I’ve been able to train the same model using the same dataset with correct results on another platform using the same hyperparamters. Due to various reasons, I want to train this model on TAO to be able to export it in Tensorrt or Onnx format to inference on Triton.
There is a fundamental problem here where the model is being trained on the background and predicting the background, and not the main object as you can see in the predicted mask. Also, the results look more like semantic segmentation rather than instance segmentation. Despite this, when I convert the pytorch model to tensorrt and inference using the converted model, I do get masks, just not with good performance. These results are with 200 pochs and fine-tuned hyperparameters using ClearML HPO with the Optuna search method.
I expect accurate results like this:
It’s important to note that, still, my mIoU (0.47) and accuracy (0.94) stay flat throughout the training process while train/val losses go down.
Here is my training spec for your reference as requested.
inst_spec.txt (2.5 KB)
May I know if you get similar issue when run the mask2former_inst.ipynb (tao_tutorials/notebooks/tao_launcher_starter_kit/mask2former/mask2former_inst.ipynb at main · NVIDIA/tao_tutorials · GitHub)?
I will try to reproduce firstly.
Current the culprit maybe:
- The dataset is enough? Your training images are 328.
- Can we set a larger backbone, for example, swin-L?
- Categroy_id starts from 0 or 1?
- Is it related to the images’ resolution because yours are 2450x500? How about cropping to square images?
More, can you run evaluation against the training dataset as well?
Hello Morgan,
- To answer your questions 1 and 2, I’ve already trained this model with successful results on another platform (not TAO) using the same model architecture and the same dataset, so it can’t be those.
- My category ID starts from 1 as noted in the documentation.
- I’ve tried cropping the images square, and it makes no difference in the overall performance (although it changes it a little bit).
- I’ve run the evaluation against training with similar results (aka no prediction mask in the model)
As we synced offline, please run 500 epochs against your 2450x500 dataset. It will cost about 3.4 hours. During training, please ignore the info “mIoU=1.000, all_acc=1.000”
I can get promising result on my side.
I did not change the format of the dataset. I trained by using the folder Mask2former_data_COCO
you shared.
After training, need to run inference with the change mentioned in https://forums.developer.nvidia.com/t/tao-model-mask2former-inference-does-not-produce-overlay-images-or-masks-annotations/338399/9?u=morganh
since your dataset has .png format.
Below is the spec file.
$ cat 20250820_mask2former.yaml
results_dir: ./mask2former_inst/
dataset:
contiguous_id: false #True
label_map: /localhome/local-morganh/Mask2former_data_COCO/annotations/label_inst.json
train:
type: 'coco'
name: "my_train"
instance_json: "/localhome/local-morganh/Mask2former_data_COCO/annotations/train.json"
img_dir: "/localhome/local-morganh/Mask2former_data_COCO/train"
batch_size: 2 #16
num_workers: 2
target_size: [2450, 500]
#target_size: [672, 672]
val:
type: 'coco'
name: "my_val"
instance_json: "/localhome/local-morganh/Mask2former_data_COCO/annotations/val.json"
img_dir: "/localhome/local-morganh/Mask2former_data_COCO/val"
batch_size: 1
num_workers: 2
target_size: [2450, 500]
#target_size: [672, 672]
test:
img_dir: "/localhome/local-morganh/Mask2former_data_COCO/test"
batch_size: 1
num_workers: 2
type: 'coco'
augmentation:
train_min_size: [500] #[640]
train_max_size: 2450
#train_crop_size: [512, 512] #[640, 640]
train_crop_size: [500, 2450] #[640, 640]
#train_crop_size: [672, 672] #[640, 640]
test_min_size: 500 #640
test_max_size: 2450
pixel_mean: [0.485, 0.456, 0.406]
pixel_std: [0.229, 0.224, 0.225]
train:
#precision: 'fp16'
precision: 'fp32'
num_gpus: 1
checkpoint_interval: 1
validation_interval: 1
num_epochs: 500 #200 #50
clip_grad_norm: 0.4
optim:
lr_scheduler: "MultiStep"
#milestones: [44, 48]
#milestones: [120, 150]
milestones: [350, 400]
type: "AdamW"
lr: 0.0003
weight_decay: 0.06
gamma: 0.1
evaluate:
checkpoint: ./mask2former_inst/train/mask2former_model_latest.pth
num_gpus: 1
results_dir: ./mask2former_inst/evaluate
inference:
checkpoint: ./mask2former_inst/train/mask2former_model_latest.pth
num_gpus: 1
gpu_ids: [0]
results_dir: ./mask2former_inst/inference_test
model:
object_mask_threshold: 0.01 #0.1
overlap_threshold: 0.01 #0.8
mode: "instance"
backbone:
#pretrained_weights: null
pretrained_weights: "/localhome/local-morganh/swin_tiny_patch4_window7_224_22k.pth"
type: "swin"
swin:
type: "tiny"
window_size: 7
ape: False
pretrain_img_size: 224
mask_former:
num_object_queries: 100
sem_seg_head:
norm: "GN"
num_classes: 2 #2
export:
input_channel: 3
input_width: 640
input_height: 640
opset_version: 17
batch_size: -1 # dynamic batch size
on_cpu: False
gen_trt_engine:
gpu_id: 0
tensorrt:
data_type: fp16
workspace_size: 4096
min_batch_size: 1
opt_batch_size: 1
max_batch_size: 1