SegFormer fine-tuning

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc): NVIDIA RTX PRO 4000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc): SegFormer
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I want to fine-tuning the Segformer network in my own data (two classes: tumor segmentation and background). How can I get the pre-trained model and its spec?

Note: In the tao_tutorials/notebooks/tao_launcher_starter_kit/segformer/segformer.ipynb at main · NVIDIA/tao_tutorials · GitHub

there is no instruction for download the pre-trained model.

I really appreciate your help.

Hi @eduardo.assuncao1 ,
Segformer can support some kinds of backbones according to SegFormer — Tao Toolkit.

Please refer to
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pretrained_segformer_imagenet/ .
For example, pretrained model for fan_base is in https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pretrained_segformer_imagenet/files?version=fan_hybrid_base_in22k_1k_384.

Similar topic is: Training SegFormer with Nv-DinoV2 backbone on Segmentation Task - #2 by Morganh.

BTW, for nv_dino_v2 models,
For example,
https://catalog.ngc.nvidia.com/orgs/nvaie/models/nv_dinov2_classification_model/files
https://catalog.ngc.nvidia.com/orgs/nvaie/models/imagenet_nv_dinov2/files

@Morganh, thank you for your reply.

I have managed to train the Segformer model. However, the performance is too low for the foreground:

!tao model segformer evaluate
-e $SPECS_DIR/test_isbi.yaml
evaluate.checkpoint=$RESULTS_DIR/isbi_experiment/train/segformer_model_latest.pth
results_dir=$RESULTS_DIR/isbi_experiment

Testing DataLoader 0: 100%|██████████| 67/67 [00:04<00:00, 15.59it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ F1_0 │ 0.9995958805084229 │
│ F1_1 │ 0.532546877861023 │
│ acc │ 0.9991925358772278 │
│ iou_0 │ 0.9991921782493591 │
│ iou_1 │ 0.3629055917263031 │
│ mf1 │ 0.7660713791847229 │
│ miou │ 0.6810488700866699 │
│ mprecision │ 0.9157484769821167 │
│ mrecall │ 0.695730984210968 │
│ precision_0 │ 0.9992848634719849 │
│ precision_1 │ 0.8322120308876038 │
│ recall_0 │ 0.9999071359634399 │
│ recall_1 │ 0.3915548324584961 │
└───────────────────────────┴───────────────────────────┘

Here are the training and evaluation specs:
train_lidc.txt (1.2 KB)

test_isbi.txt (1.1 KB)

Here are a sample of my data:

Can you give me any tip to improve the performance?

You can set a larger input size(change 224 to 512) and use a larger backbone(e.g., fan_base).
Below is an example I run with an older docker nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt . You can try as well.

$ docker run --runtime=nvidia -it --rm docker nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt /bin/bash

$ segformer train -e /localhome/local-morganh/segformer/fanbase.yaml

$ cat fanbase.yaml

results_dir: /localhome/local-morganh/segformer/fanbase 
train: 
  num_gpus: 1 
  exp_config: 
      manual_seed: 49 
  checkpoint_interval: 200 
  logging_interval: 10 
  max_iters: 20000 #5000 #10000 #5000 
  resume_training_checkpoint_path: null 
  validate: True 
  validation_interval: 10 #200 #50 
  trainer: 
      find_unused_parameters: True 
      sf_optim: 
        lr: 0.00006 
evaluate: 
  checkpoint: /localhome/local-morganh/segformer/fanbase/train/iter_20000.pth
model: 
  input_height: 512
  input_width: 512
  pretrained_model_path: /localhome/local-morganh/segformer/fan_hybrid_base_in22k_1k_384.pth  #https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pretrained_segformer_imagenet/files?version=fan_hybrid_base_in22k_1k_384 
  #pretrained_model_path: null 
  backbone: 
    type: "fan_base_16_p4_hybrid" 
dataset: 
  input_type: "grayscale" 
  img_norm_cfg: 
        mean: 
          - 127.5 
          - 127.5 
          - 127.5 
        std: 
          - 127.5 
          - 127.5 
          - 127.5 
        to_rgb: True 
  data_root: /tao-pt/tao-experiments 
  train_dataset: 
      img_dir: 
        - /localhome/local-morganh/segformer/data/image/train
     ann_dir: 
        - /localhome/local-morganh/segformer/data/mask/train
      pipeline: 
        augmentation_config: 
          random_crop: 
            #crop_size: 
            #  - 672 
            #  - 672 
            cat_max_ratio: 0.75 
          resize: 
            img_scale: 
              - 512 
              - 1024 
            ratio_range: 
              - 0.5 
              - 2.0 
          random_flip: 
            prob: 0.5
  val_dataset: 
      img_dir: 
        - /localhome/local-morganh/segformer/data/image/val
     ann_dir: 
        - /localhome/local-morganh/segformer/data/mask/val
  val_dataset: 
      img_dir: 
        - /localhome/local-morganh/segformer/data/image/test
     ann_dir: 
        - /localhome/local-morganh/segformer/data/mask/test
  palette: 
    - seg_class: background 
      rgb: 
        - 0 
        - 0 
        - 0 
      label_id: 0 
      mapping_class: background 
    - seg_class: foreground 
      rgb: 
        - 255 
        - 255 
        - 255 
      label_id: 1 
      mapping_class: foreground 
  repeat_data_times: 500 
  batch_size: 8 #4 #1 
  workers_per_gpu: 1 
export: 
  input_height: 512 
  input_width: 512 
  input_channel: 3 
  onnx_file: "${results_dir}/iter_500.onnx"
gen_trt_engine: 
  input_width: 512 
  input_height: 512 
  tensorrt: 
    data_type: FP32 
    workspace_size: 1024 
    min_batch_size: 1 
    opt_batch_size: 1 
    max_batch_size: 1 

Run evaluation
$ segformer evaluate -e /localhome/local-morganh/segformer/fanbase.yaml evaluate.checkpoint=/localhome/local-morganh/segformer/fanbase/train/iter_20000.pth

Run inference
$ segformer inference -e /localhome/local-morganh/segformer/fanbase.yaml inference.checkpoint=/localhome/local-morganh/segformer/fanbase/train/iter_20000.pth

I have a question regarding the fanbase.yaml spec? Do I need to do any modification to adapt to my custom data? For example, my image size is 512x512 with just one channel.
In your spec (fanbase.yaml) I see some thing like:
img_scale:

  • 512
  • 1024

img_norm_cfg:
mean:

  • 127.5
  • 127.5
  • 127.5
    palette:
  • seg_class: background
    rgb:
  • 0
  • 0
  • 0
    export:
    input_height: 512
    input_width: 512
    input_channel: 3

gen_trt_engine:
input_width: 512
input_height: 512
input_width: 672

My question, regarding the above configuration, is related to image size and channels. Do I need to make any change?

My previous experiment is also running against 512x512 single channel. So, you can take it as reference.

I could’t run your previous experiment do to the GPU compatibility:

docker run --gpus all -it --rm
-u $(id -u):$(id -g)
-v /home/cvig/CVIG/Devel/tao_experiments_segformer_ccg:/workspace/tao_experiments
nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt /bin/bash

===========================

=== TAO Toolkit PyTorch ===

NVIDIA Release 5.5.0-PyT (build 88113656)
TAO Toolkit Version 5.5.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:

WARNING: Detected NVIDIA RTX PRO 4000 Blackwell Generation Laptop GPU GPU, which is not yet supported in this version of the container
ERROR: No supported GPU(s) detected to run this container

However, I managed to have better result using larger model and image resolution (512x512) :

model_epoch_109_step_38060.pth
Testing DataLoader 0: 100%|██████████| 67/67 [00:20<00:00, 3.28it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ F1_0 │ 0.9996967315673828 │
│ F1_1 │ 0.7115785479545593 │
│ acc │ 0.9993942379951477 │
│ iou_0 │ 0.9993938207626343 │
│ iou_1 │ 0.5522871613502502 │
│ mf1 │ 0.8556376695632935 │
│ miou │ 0.7758404612541199 │
│ mprecision │ 0.9048025012016296 │
│ mrecall │ 0.8171431422233582 │
│ precision_0 │ 0.9995691180229187 │
│ precision_1 │ 0.8100359439849854 │
│ recall_0 │ 0.999824583530426 │
│ recall_1 │ 0.6344617605209351 │
└───────────────────────────┴───────────────────────────┘

Here is the spec: train_lidc.txt (1.2 KB)

@Morganh, do you know if there is any parameter so that we can mitigate the problem of the imbalanced classes (foreground and background)?

Can we apply zoom (augmentation) to improve detection of small object?

OK, the NVIDIA RTX PRO 4000 (blackwell) is compatible with TAO6.x docker instead of TAO5.5 docker.

Glad to know the result is better now. So, you are still running with TAO6.0 docker instead of TAO5.5 docker, right? Just to confirm since your latest train yaml is TAO6’s version.

Yes, I still running with TAO6.0 docker because of the ERROR when I try to run with TAO5.5 docker: No supported GPU(s) detected to run this container.

Please continue to do experiments inside TAO 6.0 docker.

Please try:
exp1: Disable random_color
exp2: Use nvidino_v2. An example,

results_dir: ./project/segformer_cradio_nopretrain_0817

train:
  resume_training_checkpoint_path: null
  segment:
    loss: "ce"
  num_epochs: 50
  num_nodes: 1
  validation_interval: 1
  checkpoint_interval: 10
  optim:
    lr: 0.00006
    optim: "adamw"
    policy: "linear"
    weight_decay: 0.0005

evaluate:
  num_gpus: 1
  gpu_ids: [1]
  num_nodes: 1
  checkpoint: /localhome/local-morganh/bak_two_A40_ipp1_0165/bakup/bak-morganh/a1u1g-mil-0485_segformer_xxx_2nd/segformer/project/segformer_cradio_nopretrain_0817/train/segformer_model_latest.pth
  results_dir: ${results_dir}/evaluate
  vis_after_n_batches: 1
  batch_size: 1

model:
  backbone:
    type: "vit_large_nvdinov2"
    #type: "c_radio_v2_vit_base_patch16_224"
    #type: "c_radio_v3_vit_large_patch16_reg4_dinov2"
    #type: "c_radio_v2_vit_large_patch16_224"
    #type: "fan_large_16_p4_hybrid"
    pretrained_backbone_path: null
    #pretrained_backbone_path: ./cradiov2_vcradiov2-b/c_radio_v2_b.ckpt
    freeze_backbone: False
  #decode_head:
  #  feature_strides: [4, 8, 16, 32]

dataset:
  segment:
    dataset: "SFDataset"
    root_dir: /localhome/local-morganh/bak_two_A40_ipp1_0165/bakup/bak-morganh/a1u1g-mil-0485_segformer_xxx_2nd/segformer/project_dataset/SegFormer_DINOv2_data_0_255_crop
    label_transform: "norm"
    batch_size: 4
    workers: 4
    num_classes: 2
    img_size: 672
    train_split: "train"
    validation_split: "val"
    test_split: 'val'
    predict_split: 'test'
    augmentation:
      random_flip:
        vflip_probability: 0.5
        hflip_probability: 0.5
        enable: True
      random_rotate:
        rotate_probability: 0.5
        angle_list: [90, 180, 270]
        enable: True
      random_color:
        brightness: 0.3
        contrast: 0.3
        saturation: 0.3
        hue: 0.3
        enable: False
      with_scale_random_crop:
        enable: True
      with_random_crop: True
      with_random_blur: False

exp3: use c_radio. An example,

results_dir: ./project/segformer_cradio_nopretrain_0818


train:
  resume_training_checkpoint_path: null
  segment:
    loss: "ce"
  num_epochs: 50
  num_nodes: 1
  validation_interval: 1
  checkpoint_interval: 10
  optim:
    lr: 0.00006
    optim: "adamw"
    policy: "linear"
    weight_decay: 0.0005

evaluate:
  num_gpus: 1
  gpu_ids: [1]
  num_nodes: 1
  checkpoint: /localhome/local-morganh/bak_two_A40_ipp1_0165/bakup/bak-morganh/a1u1g-mil-0485_segformer_xxx_2nd/segformer/project/segformer_cradio_nopretrain_0818/train/segformer_model_latest.pth
  results_dir: ${results_dir}/evaluate
  vis_after_n_batches: 1
  batch_size: 1

model:
  backbone:
    #type: "vit_large_nvdinov2"
    type: "c_radio_v2_vit_base_patch16_224"
    #type: "c_radio_v3_vit_large_patch16_reg4_dinov2"
    #type: "c_radio_v2_vit_large_patch16_224"
    #type: "fan_large_16_p4_hybrid"
    pretrained_backbone_path: null
    #pretrained_backbone_path: ./cradiov2_vcradiov2-b/c_radio_v2_b.ckpt
    freeze_backbone: False
  #decode_head:
  #  feature_strides: [4, 8, 16, 32]

dataset:
  segment:
    dataset: "SFDataset"
    root_dir: /localhome/local-morganh/bak_two_A40_ipp1_0165/bakup/bak-morganh/a1u1g-mil-0485_segformer_xxx_2nd/segformer/project_dataset/SegFormer_DINOv2_data_0_255_crop
    label_transform: "norm"
    batch_size: 4
    workers: 4
    num_classes: 2
    img_size: 224
    train_split: "train"
    validation_split: "val"
    test_split: 'val'
    predict_split: 'test'
    augmentation:
      random_flip:
        vflip_probability: 0.5
        hflip_probability: 0.5
        enable: True
      random_rotate:
        rotate_probability: 0.5
        angle_list: [90, 180, 270]
        enable: True
      random_color:
        brightness: 0.3
        contrast: 0.3
        saturation: 0.3
        hue: 0.3
        enable: False
      with_scale_random_crop:
        enable: True
      with_random_crop: True
      with_random_blur: False

@Morganh, Thanks for you reply. Where can I find the pre-trained weights for nvidino_v2 and c_radio?

You can try to train from scratch firstly.
Then try to use the pretrained models.
For c_raidio, the models are in https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/cradiov2/files
For nvdino_v2, the models are in https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/imagenet_nv_dinov2?version=trainable_v1.1.

I have trained the c_radio_v2 and nvdino_v2 from scratch but the performance is bad. Trying to fine-tuning the nvdino_v2 the following ERROR occurred:

Do ViT pretrained backbone interpolation
Error executing job with overrides: [‘results_dir=/results/isbi_experiment’, ‘train.num_gpus=1’]Traceback (most recent call last):
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py”, line 72, in _func
raise e
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py”, line 51, in _func
runner(cfg, **kwargs)
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 94, in main
run_experiment(
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 60, in run_experiment
model = SegFormerPlModel(experiment_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/segformer/model/segformer_pl_model.py”, line 56, in init
self._build_model(export)
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/segformer/model/segformer_pl_model.py”, line 94, in _build_model
self.model = build_model(experiment_config=self.experiment_spec, export=export)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/segformer/model/segformer.py”, line 221, in build_model
model = SegFormer(
^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/segformer/model/segformer.py”, line 124, in init
self.backbone = vit_adapter_model_dict[self.model_name]( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/segformer/model/backbones/nvdinov2.py”, line 85, in init
pretrained_backbone_ckp = interpolate_vit_checkpoint(checkpoint=pretrained_backbone_ckp,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/segformer/model/backbones/nvdinov2.py”, line 185, in interpolate_vit_checkpoint
checkpoint = interpolate_patch_embed(checkpoint=checkpoint, new_patch_size=target_patch_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/core/utils/pos_embed_interpolation.py”, line 87, in interpolate_patch_embed
patch_embed = checkpoint[‘patch_embed.proj.weight’] ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: ‘patch_embed.proj.weight’

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
2026-01-19 12:10:35,464 [TAO Toolkit] [WARNING] root 339: Telemetry data couldn’t be sent, but the command ran successfully.
2026-01-19 12:10:35,464 [TAO Toolkit] [WARNING] root 342: [Error]: ‘str’ object has no attribute ‘decode’
2026-01-19 12:10:35,464 [TAO Toolkit] [WARNING] root 346: Execution status: FAIL
2026-01-19 12:10:36,142 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 371: Stopping container.

Here is my spec:

nvdino_v2.txt (1.6 KB)

Please try to check the c_radio firstly. Thanks!

Results for the c_radio_v2 from scratch:

model_epoch_019_step_06920.pth
Testing DataLoader 0: 100%|██████████| 67/67 [00:04<00:00, 13.90it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ F1_0 │ 0.9995049238204956 │
│ F1_1 │ 0.3475402891635895 │
│ acc │ 0.9990107417106628 │
│ iou_0 │ 0.999010443687439 │
│ iou_1 │ 0.2103169858455658 │
│ mf1 │ 0.6735225915908813 │
│ miou │ 0.6046637296676636 │
│ mprecision │ 0.8852398991584778 │
│ mrecall │ 0.6121095418930054 │
│ precision_0 │ 0.99908846616745 │
│ precision_1 │ 0.7713912725448608 │
│ recall_0 │ 0.9999217987060547 │
│ recall_1 │ 0.22429728507995605 │
└───────────────────────────┴───────────────────────────┘

See results for others epochs:

Model_evaluate-from_scratch.txt (15.0 KB)

The spec for the training:

c_radio_v2_b.txt (1.4 KB)

Results for the c_radio_v2 using fine-tuning:

model_epoch_059_step_20760.pth
Testing DataLoader 0: 100%|██████████| 67/67 [00:07<00:00, 9.02it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ F1_0 │ 0.9995514750480652 │
│ F1_1 │ 0.4460074007511139 │
│ acc │ 0.999103844165802 │
│ iou_0 │ 0.9991034865379333 │
│ iou_1 │ 0.2870074510574341 │
│ mf1 │ 0.7227794528007507 │
│ miou │ 0.6430554986000061 │
│ mprecision │ 0.9067608714103699 │
│ mrecall │ 0.653510332107544 │
│ precision_0 │ 0.9991857409477234 │
│ precision_1 │ 0.8143360614776611 │
│ recall_0 │ 0.9999176263809204 │
│ recall_1 │ 0.30710306763648987 │
└───────────────────────────┴───────────────────────────┘

See results for others epochs:

Model_evaluate-fine_tuning.txt (15.0 KB)

The spec for the training:

Model_evaluate-fine_tuning.txt (15.0 KB)

c_radio_v2.txt (1.4 KB)

May I know how many dataset did you train? From the result as of now, it is better than the result training from scratch.
Could you share the full log when run training? Please check the status of loss change.

I just have the status log for training from scratch from now:

status_train_from_scratch.txt (97.4 KB)

My dataset have 1383 images.

Yes, train by fine-tuning is better. I think if could train with image resolution of 512x512 the result would be even better. But if try to train with this image resolution, it raiser an ERROR.