Training Segformer+Cradiov2 with multiple classes not working (only learning 1 class)

I want to train the cradio+segformer model on more than just a single class now. I have prepared my labels in 8-bit png masks where the bits (0, 1, 2, etc.) correspond to my label_id in dataset.segment.palette and label_transform: None. I also have provided a palette for each label_id as follows:

 `palette:`

- label_id: 0
mapping_class: background
rgb:
- 0
- 0
- 0
seg_class: background
- label_id: 1
mapping_class: Seal
rgb:
- 255
- 0
- 0
seg_class: Seal
- label_id: 2
mapping_class: FM
rgb:
- 0
- 255
- 0
seg_class: FM

Is this setup correct? What I’m observing is again the first class (background) is learning, and the other ones are all at 0 accuracy and loss; wondering if this is due to the palette being all zeros for the background

I’ve also attached a full example of the experiment yaml for a trial run with 2 classes

full_experiment_yaml.txt (3.7 KB)

that is having the same issue.

Hi, @kianmehr.ehtiatkar2
Please set num_classes to the classes number without background. The example spec yaml contains 6 classes and one unknown class(should be the background.).

Hi @Morganh , I’m setting the number of classes to not include the background but also including background as a palette with label_id=255. Inspecting the tensorboard event I see that only the first label is being learned, but even that is not done correctly. For instance, the metrics for iou_0 and F1_0 are changing and iou_1 and F1_1 are staying at 0. The inference produces all black masks which would be all background (which is not even included in the labels). I can tell from the model architecture printed during training that the decoder had has a dimension of 2, so it matches my 2 foreground classes, so I’m not sure where the disconnect is coming from.
To reiterate, my input images are RGB, and my masks are single-channel pixelwise images where the pixel integers correspond to the label_id of the classes included.
Example combined visualization from inference attached.

I’ve also included the tensorboard trends.

evaluate output logs:

I see this user is also having the same issue:

Hi @kianmehr.ehtiatkar2 ,
Every pixel in the mask must have an integer value that represents the segmentation class label_id ". But I find that some mask files do not have the correct pixel values.
Please double check the mask files. Thanks!

from PIL import Image
import numpy as np

#img = Image.open('20241215 231539 - E1_4898_crop.png').convert('L')
img = Image.open('20250228 105515 - E22_4895_cropped_x852_y47.png').convert('L')
#img = Image.open('20241026 073531 - E1_5199_cropped_x1986_y0.png').convert('L')
#img = Image.open('20241117 232345 - E1_4885_cropped_x0_y0.png').convert('L')
#img = Image.open('20240506 004517 - E12_4912_cropped_x852_y0.png').convert('L')

img_np = np.array(img)
unique_values = np.unique(img_np)

print(f'totally there are {len(unique_values)} kinds of pixels: {unique_values}')
print('----------------------')

Hi @Morganh I purged my dataset of background-only images and masks and trained again with the same behavior.

I’m noticing that only the first label or label_id=0 is being learned per the combined visualization image attached. I’ve also attached the mask file for this image for reference showing 2 objects with integers 0 and 1 and background at 255. Only integer 0 is being learned, and the model performance is also reflecting this.

━━━━━━━━━━━━━┓
┃        Test metric        ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│           F1_0            │    0.9999999403953552     │
│           F1_1            │            0.0            │
│            acc            │            1.0            │
│           iou_0           │            1.0            │
│           iou_1           │            0.0            │
│            mf1            │    0.4999999701976776     │
│           miou            │            0.5            │
│        mprecision         │            0.5            │
│          mrecall          │            0.5            │
│        precision_0        │            1.0            │
│        precision_1        │            0.0            │
│         recall_0          │            1.0            │
│         recall_1          │            0.0            │
└───────────────────────────┴───────────────────────────┘

Does this help you troubleshoot the issue further?

Combined vis:

Original mask for the image containing integers of 0, 1, and 255:

Hi @kianmehr.ehtiatkar2 ,
Firstly, make sure each mask file has corresponding pixel value against the class.
For example, one class has pixel-value=0, another class has pixel-value=1.

Successful case1:
Then if you are going to use palette, please change 1-channel mask png file to 3-channel mask png file. Also, need to set label_transform: None.

# cat change_1_channle_to_3_channel_green.py
# pip install pillow numpy
import os, glob
import numpy as np
from PIL import Image

#in_dir  = "xxx/data/masks/train"     # 1-channel 0/1 mask folder
#out_dir = "xxx/data/masks_3channel/train"      # output RGB mask folder
in_dir  = "xxx/data/masks/val"     # 1-channel 0/1 mask folder
out_dir = "xxx/data/masks_3channel/val"      # output RGB mask folder
os.makedirs(out_dir, exist_ok=True)

for p in glob.glob(os.path.join(in_dir, "*.png")):
    g = np.array(Image.open(p))                   # 8-bit 1-channel 
    assert g.ndim == 2, f"Not single channel: {p}"
    rgb = np.zeros((g.shape[0], g.shape[1], 3), dtype=np.uint8)
    rgb[g == 1] = (0, 255, 0)                     # set to green color
    Image.fromarray(rgb, mode="RGB").save(
        os.path.join(out_dir, os.path.basename(p)), format="PNG"

Then, set below in the spec file.

    num_classes: 2
    img_size: 224
    #label_transform: "norm"
    label_transform: None
    palette:
    - label_id: 0
      mapping_class: background
      rgb:
      - 0
      - 0
      - 0
      seg_class: background
    - label_id: 1
      mapping_class: tab
      rgb:
      - 0
      - 255
      - 0
      seg_class: tab

Successful case2:
If you are not using palette, you can still use the 1-channel mask png file. Set label_transform: "norm".

    num_classes: 2
    img_size: 224
    label_transform: "norm"

@Morgan Huang, thank you for the response. I ran a quick experiment with the first method with removing the palette, and I saw meaningful metrics, so thank you!

  1. If I have multiple classes, how do I extract the respective class from my model from the resulting segmentation from inference? The produced masks are black and white and binary and don’t differentiate between the different classes. For that, do I need to specify the palette?
  2. Also question on the 3-channel masks. Do I need to keep the same formatting for the labels. In other words, do I still have pixel integers for each class that is repeated across the 3 channels? class_1 being (0, 0, 0), class_2 (1, 1, 1), background (255, 255, 255), etc.?
  3. Lastly, can I use this pattern for a binary task as well? My intention is to avoid having two preprocessing pipelines for segformer training. In other words, my masks having 2 integer values (0, and 255), and num_classes=1

As we synced offline, if want to contain background + 3-channel RGB png mask + palette, then,

dataset:
  segment:
    dataset: "SFDataset"
    root_dir: /path/to/dataset_rgb
    num_classes: 3            # background + 2 foreground classes
    img_size: 224
    train_split: "train"
    validation_split: "val"
    test_split: "val"
    predict_split: "test"
    label_transform: None     # palette uses 0–255 range; disable normalization on labels
    palette:
      - label_id: 0
        mapping_class: background
        rgb: [0, 0, 0]
        seg_class: background
      - label_id: 1
        mapping_class: class_1
        rgb: [0, 255, 0]
        seg_class: class_1
      - label_id: 2
        mapping_class: class_2
        rgb: [255, 0, 0]
        seg_class: class_2
    # masks must contain ONLY these exact RGB colors; labels will be mapped to indices {0,1,2} accordingly

I already tried all the above, and seems like no matter the num_class, if I don’t include a palette and set label transform to norm, all foreground classes are combined into a single foreground class. Also, when I set background to 255, my val_loss metric starts reporting only NaN. The only combination that worked with proper metrics was including everything including the background class plus 3-channel masks.
Here are the charts for reference. Plus it doesn’t seem like the labels are loaded properly. Despite 3-channel RGB masks with palettes (picture attached with black, red, and green), the combined visualization shows black and white masks.