Migrating TAO3 unet model to segformer, **Foreground has performance of 0.0 !**

Please provide the following information when requesting support.

• Hardware RTX3090
• Network Type unet / segformer

I have an existing unet model that has this performance:

“{‘foreground’: {‘precision’: 0.9996697, ‘Recall’: 0.9997607, ‘F1 Score’: 0.999715147331038, ‘iou’: 0.99943054}, ‘background’: {‘precision’: 0.6680363, ‘Recall’: 0.59312135, ‘F1 Score’: 0.6283537780304688, ‘iou’: 0.45810193}}”

This is the spec file for unet:

unet_train_resnet_6S300.txt (1.3 KB)

Using the same dataset of 1280 X 704 Grayscale images, and mask values either 0 or 255, I am looking to see if performance can be improved by training a segformer with this spec file (modified version of the isbi example):

train_isbi.yaml (1.4 KB)

Training with the segformer notebook, the Groundtruth Masks are visualized properly, and the training result is Very bad as there is no detection at all of the foreground class!!


2023-02-17 00:07:52,630 - mmseg - INFO - per class results:
2023-02-17 00:07:52,631 - mmseg - INFO - 
+------------+-------+-------+
| Class      | IoU   | Acc   |
+------------+-------+-------+
| background | 99.92 | 100.0 |
| foreground | 0.0   | 0.0   |
+------------+-------+-------+
2023-02-17 00:07:52,631 - mmseg - INFO - Summary:
2023-02-17 00:07:52,631 - mmseg - INFO - 
+--------+-------+------+-------+
| Scope  | mIoU  | mAcc | aAcc  |
+--------+-------+------+-------+
| global | 49.96 | 50.0 | 99.92 |
+--------+-------+------+-------+
2023-02-17 00:07:52,694 - mmseg - INFO - Iter(val) [1000]	mIoU: 0.4996, mAcc: 0.5000, aAcc: 0.9992
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: PASS
2023-02-17 02:07:59,748 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The complete training log is here:
segformer training output 2023 02 17.txt (18.8 KB)

Foreground has performance of 0.0 !

Thanks for the help

Dave

Tried with a modified spec file, to same result:

train_isbi.yaml (1.4 KB)

May I know that if you meet similar issue when you run default segformer notebook without any change?

It runs well to completion and has the following results on evaluate:


+------------+-------+-------+
| Class      | IoU   | Acc   |
+------------+-------+-------+
| foreground | 46.34 | 54.61 |
| background | 85.72 | 95.51 |
+------------+-------+-------+

In both scenarios the images are grayscale, and the mask values are either 0 or 255. The only difference is the size of the images, which in my case are 1280X704 as per requirements of unet.

Thanks!

David

See Data Annotation Format
For grayscale input images, the mask is a single channel image with size equal to the input image. Every pixel has a value of 255 or 0, which corresponds respectively to a label_id of 1 or 0.

Can you double check the mask image’s pixel value and also check if it is set correctly in the yaml file?

More, from the log, you are running with 500 images, can you run evaluation against the training images to narrow down?

This I know.

Verified. Wrote a program that counts all the pixels in the 500 mask images and:

Number of pixels with value 0: 450167723
Number of pixels with value 255: 392277
Number of pixels with other values: 0

And in the yaml spec file,


  palette:
    - seg_class: background
      rgb:
        - 0
        - 0
        - 0
      label_id: 0
      mapping_class: background
    - seg_class: foreground
      rgb:
        - 255
        - 255
        - 255
      label_id: 1
      mapping_class: foreground

Updated the test yaml spec file with

#  test_img_dir: /data/images/val
#  test_ann_dir: /data/masks/val
  test_img_dir: /data/images/train
  test_ann_dir: /data/masks/train

Then run


!tao segformer evaluate \
                    -e $SPECS_DIR/test_isbi.yaml \
                    -k $KEY \
                    -g $NUM_GPUS \
                    model_path=$RESULTS_DIR/isbi_experiment/isbi_model.tlt \
                    output_dir=$RESULTS_DIR/isbi_experiment

This gives the same performance 0.0 in the foreground class.


+------------+-------+-------+
| Class      | IoU   | Acc   |
+------------+-------+-------+
| background | 99.91 | 100.0 |
| foreground | 0.0   | 0.0   |
+------------+-------+-------+
Summary:

+--------+-------+------+-------+
| Scope  | mIoU  | mAcc | aAcc  |
+--------+-------+------+-------+
| global | 49.96 | 50.0 | 99.91 |
+--------+-------+------+-------+
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: PASS
2023-02-20 23:38:38,615 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Complete log for evaluate here:

segformer evaluate 2023 02 20.txt (7.3 KB)

The training log is in first post…

@Morganh Thanks!

From this log, I find your yaml is different from the test_isbi.yaml in the notebook folder.
Can you refer to the test_isbi.yaml to run evaluation?

In order to be as similar to the isbi experiment I changed the masks such that the value 0 is for foreground and 255 for background.

These are the train and test spec files where the test is pointing to the train images:

train_isbi.yaml (1.4 KB)
test_isbi.yaml (836 Bytes)

I then retrained the model. While training, I got twice the following:


2023-02-21 11:33:19,360 - mmseg - INFO - per class results:
2023-02-21 11:33:19,361 - mmseg - INFO - 
+------------+-------+-------+
| Class      | IoU   | Acc   |
+------------+-------+-------+
| foreground | 0.0   | 0.0   |
| background | 99.92 | 100.0 |
+------------+-------+-------+
2023-02-21 11:33:19,361 - mmseg - INFO - Summary:
2023-02-21 11:33:19,361 - mmseg - INFO - 
+--------+-------+------+-------+
| Scope  | mIoU  | mAcc | aAcc  |
+--------+-------+------+-------+
| global | 49.96 | 50.0 | 99.92 |
+--------+-------+------+-------+

And again at the end of training

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 78/78, 6.4 task/s, elapsed: 12s, ETA:     0s

2023-02-21 11:39:22,531 - mmseg - INFO - per class results:
2023-02-21 11:39:22,531 - mmseg - INFO - 
+------------+-------+-------+
| Class      | IoU   | Acc   |
+------------+-------+-------+
| foreground | 0.0   | 0.0   |
| background | 99.92 | 100.0 |
+------------+-------+-------+
2023-02-21 11:39:22,531 - mmseg - INFO - Summary:
2023-02-21 11:39:22,532 - mmseg - INFO - 
+--------+-------+------+-------+
| Scope  | mIoU  | mAcc | aAcc  |
+--------+-------+------+-------+
| global | 49.96 | 50.0 | 99.92 |
+--------+-------+------+-------+

complete training log 2023 02 21.txt (18.8 KB)

And evaluating I get the same 0.0 performance for foreground:

evaluate complete log 2023 02 21.txt (7.3 KB)

@Morganh Thanks!

Can you compare the log when you run default isbi notebook to get some hints?

This is the isbi original case training log:

isbi full training log 2003 02 14.txt (18.8 KB)

Other than the image size related differences (1280X704 VS 512X512), I don’t see any anomaly…

          resize:
            img_scale:
            - 2048
            - 1280
        Pad:
          size_ht: 704
          size_wd: 1280

@Morganh Thanks!!!

In isbi training log, the loss keeps decreasing.
Can you run more iterations? More, is it the same image format as isbi image?
You can also tune the learning rate.

@Morganh Thanks.

This is a dataset that I used to train a unet model, with the following results after 250 epochs:

"{'foreground': {'precision': 0.99967116, 'Recall': 0.99971056, 'F1 Score': 0.9996908303193671, 'iou': 0.9993819}, 
  'background': {'precision': 0.6252627, 'Recall': 0.5948752, 'F1 Score': 0.609690577910107, 'iou': 0.43852866}}"

I’ll do more iterations and increase the learning rate

A new problem arises:

AS I am now investing substantial time in further training, I want to setup in a way that training continues from a checkpoint, and modified the training command to include --resume_training_checkpoint_path:

!tao segformer train \
                  -e $SPECS_DIR/train_isbi.yaml \
                  -r $RESULTS_DIR/isbi_experiment \
                  -k $KEY \
                  -g $NUM_GPUS \
                  --resume_training_checkpoint_path=$RESULTS_DIR/isbi_experiment 

This returns error

usage: train.py [--help] [--hydra-help] [--version] [--cfg {job,hydra,all}]
                [--resolve] [--package PACKAGE] [--run] [--multirun]
                [--shell-completion] [--config-path CONFIG_PATH]
                [--config-name CONFIG_NAME] [--config-dir CONFIG_DIR]
                [--info [{all,config,defaults,defaults-tree,plugins,searchpath}]]
                [overrides [overrides ...]]
train.py: error: unrecognized arguments: --resume_training_checkpoint_path=/results/isbi_experiment

Also tried in the spec file with:

  resume_training_checkpoint_path: /results/isbi_experiment

And the error here is

  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/checkpoint.py", 
line 333, in load_from_local    raise FileNotFoundError(f'{filename} can not be found.')

@Morganh thanks!

Dave

And sure enough, after 10,000 iterations, I get some result, but not yet satisfactory.

Validation during training produces this last result at the end of 10,000 iterations:

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 78/78, 6.2 task/s, elapsed: 13s, ETA:     0s

2023-02-21 19:11:17,272 - mmseg - INFO - per class results:
2023-02-21 19:11:17,273 - mmseg - INFO - 
+------------+-------+-------+
| Class      | IoU   | Acc   |
+------------+-------+-------+
| foreground | 23.19 | 27.21 |
| background | 99.93 | 99.99 |
+------------+-------+-------+

But running tao evaluate on test dataset gives IoU of 0.0!:

+------------+-------+-------+
| Class      | IoU   | Acc   |
+------------+-------+-------+
| foreground | 0.05  | 0.05  |
| background | 99.92 | 100.0 |
+------------+-------+-------+

Waiting for a solution to the checkpoint continuation from previous post.

@Morganh Thanks!

Can you set as below?
resume_training_checkpoint_path=$RESULTS_DIR/isbi_experiment

Thanks!

!tao segformer train \
                  -e $SPECS_DIR/train_isbi.yaml \
                  -r $RESULTS_DIR/isbi_experiment \
                  -k $KEY \
                  -g $NUM_GPUS \
                  resume_training_checkpoint_path=$RESULTS_DIR/isbi_experiment

gives error

Could not override 'resume_training_checkpoint_path'.
To append to your config use +resume_training_checkpoint_path=/results/isbi_experiment
Key 'resume_training_checkpoint_path' not in 'SFTrainExpConfig'
    full_key: resume_training_checkpoint_path
    object_type=SFTrainExpConfig

I also tried

+resume_training_checkpoint_path=/results/isbi_experiment

and

+resume_training_checkpoint_path=$RESULTS_DIR/isbi_experiment

All result in error.

Using = in the spec file is not understood…

I updated the specs file to have

  logging:
    interval: 200
  resume_training_checkpoint_path: $RESULTS_DIR/isbi_experiment
  runner:
    max_iters: 10100

The training log reports:

2023-02-22 09:40:45,144 - mmseg - INFO - load checkpoint from local path: $RESULTS_DIR/isbi_experiment

But has error :

  File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/checkpoint.py", line 333, in load_from_local
    raise FileNotFoundError(f'{filename} can not be found.')
FileNotFoundError: $RESULTS_DIR/isbi_experiment can not be found.

Also, tried in spec file

  resume_training_checkpoint_path: /results/isbi_experiment

has the same error:

FileNotFoundError: /results/isbi_experiment can not be found.

spec file (1.4 KB)

full train log.txt (14.5 KB)

@Morganh Thanks

The way to continue training using checkpoints in segformer is to update the spec file with the full path to the filename :

  resume_training_checkpoint_path: /results/isbi_experiment/iter_10100.tlt

Talk about wasting time from bad documentation…

@Morganh I will train more iterations to see where I get some good results

Many thanks

@Morganh After 20,000 iteration, the validation is

+------------+-------+-------+
| Class      | IoU   | Acc   |
+------------+-------+-------+
| foreground | 31.29 | 36.52 |
| background | 99.93 | 99.99 |
+------------+-------+-------+

But running tao evaluate with spec file with

  test_img_dir: /data/images/val
  test_ann_dir: /data/masks/val

Gives Very very poor performance compared to unet

+------------+-------+-------+
| Class      | IoU   | Acc   |
+------------+-------+-------+
| foreground | 0.05  | 0.05  |
| background | 99.92 | 100.0 |
+------------+-------+-------+

And evaluating with the same training images

#  test_img_dir: /data/images/val
#  test_ann_dir: /data/masks/val
  test_img_dir: /data/images/train
  test_ann_dir: /data/masks/train

I get

+------------+-------+-------+
| Class      | IoU   | Acc   |
+------------+-------+-------+
| foreground | 0.1   | 0.1   |
| background | 99.91 | 100.0 |
+------------+-------+-------+

Not sure where to go next to get segformer results.

Thanks!

I recall that you change the pixel value for the training images of two classes. Did you make sure the test images are also changed?

Sure!

But just to double check I did

!python vis_annotation_isbi.py -i $HOST_DATA_DIR/images/val -m $HOST_DATA_DIR/masks/val -o $HOST_RESULTS_DIR/isbi_experiment/vis_gt --num_classes 2 --num_images 10

And it comes our right (Sent you an image in private…)