SegFormer fine-tuning

Is it possible to share a small part of dataset with me for further checking? If yes, you can send me via a private message to share.

As we synced offline, since fan_base is better as of now, could you please use a large one, i.e., fan_large_16_p4_hybrid to trigger training? Thanks.

Disable random_color + Use fan_large_16_p4_hybrid

Hi,
I copy your latest result here.
status.txt (65.9 KB)

experiment.txt (3.4 KB)

So, the larger backbone can improve. val_miou": 0.78

What is the F1_1 now?

Hi @Morganh , thanks for your reply.

Here is the complementary information of the performance:

Testing DataLoader 0: 100%|██████████| 67/67 [00:21<00:00, 3.13it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ F1_0 │ 0.9997057914733887 │
│ F1_1 │ 0.7192901968955994 │
│ acc │ 0.9994122385978699 │
│ iou_0 │ 0.9994118213653564 │
│ iou_1 │ 0.5616340637207031 │
│ mf1 │ 0.8594979643821716 │
│ miou │ 0.7805229425430298 │
│ mprecision │ 0.9108108282089233 │
│ mrecall │ 0.8196027278900146 │
│ precision_0 │ 0.9995748996734619 │
│ precision_1 │ 0.8220468163490295 │
│ recall_0 │ 0.999836802482605 │
│ recall_1 │ 0.6393686532974243 │
└───────────────────────────┴───────────────────────────┘

For your case, please try to disable some kinds of augmentation and trigger running.

    augmentation:
      random_flip:
        vflip_probability: 0.5
        hflip_probability: 0.5
        enable: False #True   
      random_rotate:
        rotate_probability: 0.5
        angle_list: [90, 180, 270]
        enable: False #True 
      random_color:
        brightness: 0.3
        contrast: 0.3
        saturation: 0.3
        hue: 0.3
        enable: False
      with_scale_random_crop:
        enable: False #True   
      with_random_crop: False #True  
      with_random_blur: False
    label_transform: "norm"

BTW, on my side, I just use the dataset(10 images and mask) you shared to run c-radio from scratch. It can continue to get lower train_loss_epoch in the log and get larger F1_1. You can extend the training num_epochs along with above augmentation setting to run training against 1) c-radio from scratch training 2) fan-large training

Trying this experiment, the following error occurred:

Epoch 24:  51%|█████▏    | 355/692 [01:17<01:14,  4.55it/s, v_num=1, train_loss_step=0.000232, train_loss_epoch=0.00028, val_loss=0.00695, val_acc=0.999]ERROR: Unexpected segmentation fault encountered in worker.
Epoch 24:  52%|█████▏    | 359/692 [01:18<01:13,  4.55it/s, v_num=1, train_loss_step=0.000132, train_loss_epoch=0.00028, val_loss=0.00695, val_acc=0.999]Error executing job with overrides: ['results_dir=/results/isbi_experiment', 'train.num_gpus=1']Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1243, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 525, in Client
    answer_challenge(c, authkey)
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 953, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 395, in _recv
    chunk = read(handle, remaining)
            ^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 72, in _func
    raise e
  File "/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 51, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py", line 94, in main
    run_experiment(
  File "/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py", line 78, in run_experiment
    trainer.fit(model, dm, ckpt_path=resume_ckpt)
  File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py", line 982, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py", line 1026, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/fit_loop.py", line 216, in run
    self.advance()
  File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/fit_loop.py", line 455, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 150, in run
    self.advance(data_fetcher)
  File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 282, in advance
    batch, _, __ = next(data_fetcher)
                   ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/fetchers.py", line 134, in __next__
    batch = super().__next__()
            ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/fetchers.py", line 61, in __next__
    batch = next(self.iterator)
            ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/utilities/combined_loader.py", line 78, in __next__
out[i] = next(self.iterators[i])             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 701, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1448, in _next_data
    idx, data = self._get_data()
                ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1412, in _get_data
    success, data = self._try_get_data()
                    ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1271, in _try_get_data
[tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/tempfile.py", line 721, in NamedTemporaryFile
    file = _io.open(dir, mode, buffering=buffering,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/tempfile.py", line 716, in opener
    def opener(*args):
    
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 911) is killed by signal: Segmentation fault. 

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Epoch 24:  52%|█████▏    | 359/692 [01:23<01:17,  4.31it/s, v_num=1, train_loss_step=0.000132, train_loss_epoch=0.00028, val_loss=0.00695, val_acc=0.999]2026-02-13 10:01:16,874 [TAO Toolkit] [WARNING] root 339: Telemetry data couldn't be sent, but the command ran successfully.
2026-02-13 10:01:16,874 [TAO Toolkit] [WARNING] root 342: [Error]: 'str' object has no attribute 'decode'
2026-02-13 10:01:16,874 [TAO Toolkit] [WARNING] root 346: Execution status: FAIL
2026-02-13 10:01:17,530 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 371: Stopping container.

Here is the spec file:

exp_fan_large.txt (3.4 KB)

It seems to be OOM(out of memory). Please set lower workers and retry.

Here are the results after solve the OOM by reducing number of workers:

The result is worse. According to the train x val loss, the model is overfitting. Probable because the model is large and we disabled the data augmentation:

Here are the training and evaluate performance:

fan_large_evaluate.txt (15.0 KB)

training_log.txt (937 Bytes)

Did you ever run experiment1(c-radio training from scratch)? As mentioned above, I just use the dataset(10 images and mask) you shared to run c-radio from scratch. It can continue to get lower train_loss_epoch in the log and get larger F1_1 .

Yes, but no success. Here are the experiment spec and log for the same 10 images that I shared with you:

status.txt (195.0 KB)

experiment.txt (3.4 KB)

Lest try to reproduce your experiment that was a good performance. Can you please share the complete spec of your experiment?

Yes, here is the spec file I run against your 10 images.
20260211_forum_356578_segformer_train_spec.txt (6.4 KB)

Here is the training log. BTW, I resume training several times to confirm the loss and F1_1 status. While resuming training, the only change is to set a larger num_epochs. In my first training, the num_epochs is 100. In the end, I resume training for several times and the latest num_epochs is 300 as mentioned in above shared spec file.

20260211_forum_356578_segformer_train_logbak.txt (170.5 KB)

@Morganh, could you please reproduce this experiment using toolkit_version: 6.0.0?

Hi @eduardo.assuncao1 ,
I was using docker run --runtime -it --rm -v /localhome/local-morganh:/localhome/local-morganh nvcr.io/nvidia/tao/tao-toolkit:6.25.11-pyt /bin/bash. It is the latest version of TAO 6.