Is it possible to share a small part of dataset with me for further checking? If yes, you can send me via a private message to share.
As we synced offline, since fan_base is better as of now, could you please use a large one, i.e., fan_large_16_p4_hybrid to trigger training? Thanks.
Disable random_color + Use fan_large_16_p4_hybrid
Hi,
I copy your latest result here.
status.txt (65.9 KB)
experiment.txt (3.4 KB)
So, the larger backbone can improve. val_miou": 0.78
What is the F1_1 now?
Hi @Morganh , thanks for your reply.
Here is the complementary information of the performance:
Testing DataLoader 0: 100%|██████████| 67/67 [00:21<00:00, 3.13it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ F1_0 │ 0.9997057914733887 │
│ F1_1 │ 0.7192901968955994 │
│ acc │ 0.9994122385978699 │
│ iou_0 │ 0.9994118213653564 │
│ iou_1 │ 0.5616340637207031 │
│ mf1 │ 0.8594979643821716 │
│ miou │ 0.7805229425430298 │
│ mprecision │ 0.9108108282089233 │
│ mrecall │ 0.8196027278900146 │
│ precision_0 │ 0.9995748996734619 │
│ precision_1 │ 0.8220468163490295 │
│ recall_0 │ 0.999836802482605 │
│ recall_1 │ 0.6393686532974243 │
└───────────────────────────┴───────────────────────────┘
For your case, please try to disable some kinds of augmentation and trigger running.
augmentation:
random_flip:
vflip_probability: 0.5
hflip_probability: 0.5
enable: False #True
random_rotate:
rotate_probability: 0.5
angle_list: [90, 180, 270]
enable: False #True
random_color:
brightness: 0.3
contrast: 0.3
saturation: 0.3
hue: 0.3
enable: False
with_scale_random_crop:
enable: False #True
with_random_crop: False #True
with_random_blur: False
label_transform: "norm"
BTW, on my side, I just use the dataset(10 images and mask) you shared to run c-radio from scratch. It can continue to get lower train_loss_epoch in the log and get larger F1_1. You can extend the training num_epochs along with above augmentation setting to run training against 1) c-radio from scratch training 2) fan-large training
Trying this experiment, the following error occurred:
Epoch 24: 51%|█████▏ | 355/692 [01:17<01:14, 4.55it/s, v_num=1, train_loss_step=0.000232, train_loss_epoch=0.00028, val_loss=0.00695, val_acc=0.999]ERROR: Unexpected segmentation fault encountered in worker.
Epoch 24: 52%|█████▏ | 359/692 [01:18<01:13, 4.55it/s, v_num=1, train_loss_step=0.000132, train_loss_epoch=0.00028, val_loss=0.00695, val_acc=0.999]Error executing job with overrides: ['results_dir=/results/isbi_experiment', 'train.num_gpus=1']Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1243, in _try_get_data
data = self._data_queue.get(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
fd = df.detach()
^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 525, in Client
answer_challenge(c, authkey)
File "/usr/lib/python3.12/multiprocessing/connection.py", line 953, in answer_challenge
message = connection.recv_bytes(256) # reject large message
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 395, in _recv
chunk = read(handle, remaining)
^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 72, in _func
raise e
File "/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 51, in _func
runner(cfg, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py", line 94, in main
run_experiment(
File "/usr/local/lib/python3.12/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py", line 78, in run_experiment
trainer.fit(model, dm, ckpt_path=resume_ckpt)
File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py", line 575, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py", line 982, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/trainer/trainer.py", line 1026, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/fit_loop.py", line 216, in run
self.advance()
File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/fit_loop.py", line 455, in advance
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 150, in run
self.advance(data_fetcher)
File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 282, in advance
batch, _, __ = next(data_fetcher)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/fetchers.py", line 134, in __next__
batch = super().__next__()
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/loops/fetchers.py", line 61, in __next__
batch = next(self.iterator)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
out = next(self._iterator)
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/pytorch_lightning/utilities/combined_loader.py", line 78, in __next__
out[i] = next(self.iterators[i]) ^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 701, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1448, in _next_data
idx, data = self._get_data()
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1412, in _get_data
success, data = self._try_get_data()
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1271, in _try_get_data
[tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/tempfile.py", line 721, in NamedTemporaryFile
file = _io.open(dir, mode, buffering=buffering,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/tempfile.py", line 716, in opener
def opener(*args):
File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 911) is killed by signal: Segmentation fault.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 24: 52%|█████▏ | 359/692 [01:23<01:17, 4.31it/s, v_num=1, train_loss_step=0.000132, train_loss_epoch=0.00028, val_loss=0.00695, val_acc=0.999]2026-02-13 10:01:16,874 [TAO Toolkit] [WARNING] root 339: Telemetry data couldn't be sent, but the command ran successfully.
2026-02-13 10:01:16,874 [TAO Toolkit] [WARNING] root 342: [Error]: 'str' object has no attribute 'decode'
2026-02-13 10:01:16,874 [TAO Toolkit] [WARNING] root 346: Execution status: FAIL
2026-02-13 10:01:17,530 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 371: Stopping container.
Here is the spec file:
exp_fan_large.txt (3.4 KB)
It seems to be OOM(out of memory). Please set lower workers and retry.
Here are the results after solve the OOM by reducing number of workers:
The result is worse. According to the train x val loss, the model is overfitting. Probable because the model is large and we disabled the data augmentation:
Here are the training and evaluate performance:
fan_large_evaluate.txt (15.0 KB)
training_log.txt (937 Bytes)
Did you ever run experiment1(c-radio training from scratch)? As mentioned above, I just use the dataset(10 images and mask) you shared to run c-radio from scratch. It can continue to get lower train_loss_epoch in the log and get larger F1_1 .
Yes, but no success. Here are the experiment spec and log for the same 10 images that I shared with you:
status.txt (195.0 KB)
experiment.txt (3.4 KB)
Lest try to reproduce your experiment that was a good performance. Can you please share the complete spec of your experiment?
Yes, here is the spec file I run against your 10 images.
20260211_forum_356578_segformer_train_spec.txt (6.4 KB)
Here is the training log. BTW, I resume training several times to confirm the loss and F1_1 status. While resuming training, the only change is to set a larger num_epochs. In my first training, the num_epochs is 100. In the end, I resume training for several times and the latest num_epochs is 300 as mentioned in above shared spec file.
@Morganh, could you please reproduce this experiment using toolkit_version: 6.0.0?
Hi @eduardo.assuncao1 ,
I was using docker run --runtime -it --rm -v /localhome/local-morganh:/localhome/local-morganh nvcr.io/nvidia/tao/tao-toolkit:6.25.11-pyt /bin/bash. It is the latest version of TAO 6.
