Pre-trained Segformer - CityScapes - Input dims appear to be 224x224

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) A6000
• DeepStream Version N/A
• JetPack Version (valid for Jetson only) N?A
• TensorRT Version 24.01 (Docker container)
• NVIDIA GPU Driver Version (valid for GPU only) 535.154.05
• Issue Type( questions, new requirements, bugs) Question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing) Run triton on CityScapes models
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Hi

I’ve been using the CitySemSeg etlt model (1080x1920) with deepstream using a C++ pipeline. All is working fine. I noticed there are new models called Pre-trained Segformer - CityScapes with onnx models (they appear to be annotated _224). I converted those to TensorRT and served from Triton Inference Server. I’m using the Python client however I need to adjust the input tensors to [3,244,244]. I did have them at [3,1024,1024] as suggested by the model narrative. The reduction in input shape makes the output tensors poor in appearance - are there [3,1024,1024] versions?

It doesn’t make much difference using Deepstream as the input shape is [3,224,224].

Cheers

Are you talking about this? Pre-trained Segformer - CityScapes | NVIDIA NGC

Hi Fiona,

Yes, those models appear relatively new and the narrative suggests that the input tensors are dimensioned as [3,102,1024] whereas they are actually [3,224,224].

Cheers

@IainA
Which one did you use from Pre-trained Segformer - CityScapes | NVIDIA NGC?

Hi

All the depolyable ones (ONNX). They seem to be all 224 input and the file names all have _224 suffixes.

Cheers

In ngc, there are not [3,1024,1024] versions.
Will sync internally. Can you double check the [3,224,224] deploy models with tao-pytorch inference command?

Hi Morgan,

Triton inference server reports [3,224,224] input required when I pass a [3,1024,1024] image (as per the narrative and when I pass a [3,224,224] image it handles it correctly.

What is the difference between these models and the CitySemSegFormer models → citysemsegformer

For [3,1024,1024] image, it can handle correctly, right?
For [3,224,224] image, it can also handle correctly, right?

In CitySemSegformer | NVIDIA NGC, there are two kinds of backbones.
One is fan_base_16_p4_hybrid. The onnx input is 3x1024x1024.
The other is mit_b5.

In Pre-trained Segformer - CityScapes | NVIDIA NGC, the backbone are based on fan series. The input is 3x224x224.

Hi Morgan

Sorry for delay and any confusion. To me all the deploy models expect input tensors of [3,224,224]. I could not get any deploy models to work with [3,1024,1024] images. To it looks like only the _224 models were uploaded. For the CitySEmSegformer model from NGC, I can get any size of image to work.

Hope that clarifies.

Cheers

Got it. It makes sense.

Seems to be a model request. I will sync internally.

Hi @IainA
Please try to download the model and export to the onnx file you expect.
For example, download Pre-trained Segformer - CityScapes | NVIDIA NGC , then run tao model export to generate an onnx file.

segformer export -e /home/morganh/demo_3.0/forum_repro/segformer/spec.yaml export.checkpoint=cityscapes_fan_tiny_hybrid_224.pth export.onnx_file=1024_1024.onnx -r result

Spec file:

export:
  input_height: 1024
  input_width: 1024
  input_channel: 3
model:
  backbone:
    type: "fan_tiny_8_p4_hybrid"
dataset:
  img_norm_cfg:
      mean:
          - 123.675
          - 116.28
          - 103.53
      std:
          - 58.395
          - 57.12
          - 57.375
      to_rgb: true
  test_dataset:
      img_dir: /home/morganh/demo_2.0/unet/data/cityscapes/gtFine/train
      ann_dir: /home/morganh/demo_2.0/unet/data/cityscapes/gtFine/train
      pipeline:
        augmentation_config:
          resize:
            keep_ratio: True
  input_type: "rgb"
  data_root: /home/morganh/demo_2.0/unet/data/cityscapes/gtFine
  palette:
    - seg_class: road
      rgb:
        - 128
        - 64
        - 128
      label_id: 7
      mapping_class: road
    - seg_class: sidewalk
      rgb:
        - 244
        - 35
        - 232
      label_id: 8
      mapping_class: sidewalk
    - seg_class: building
      rgb:
        - 70
        - 70
        - 70
      label_id: 11
      mapping_class: building
    - seg_class: wall
      rgb:
        - 102
        - 102
        - 102
      label_id: 12
      mapping_class: wall
    - seg_class: fence
      rgb:
        - 190
        - 153
        - 153
      label_id: 13
      mapping_class: fence
    - seg_class: pole
      rgb:
        - 153
        - 153
        - 153
      label_id: 17
      mapping_class: pole
    - seg_class: traffic light
      rgb:
        - 250
        - 170
        - 30
      label_id: 19
      mapping_class: traffic light
    - seg_class: traffic sign
      rgb:
        - 220
        - 220
        - 0
      label_id: 20
      mapping_class: traffic sign
    - seg_class: vegetation
      rgb:
        - 107
        - 142
        - 35
      label_id: 21
      mapping_class: vegetation
    - seg_class: terrain
      rgb:
        - 152
        - 251
        - 152
      label_id: 22
      mapping_class: terrain
    - seg_class: sky
      rgb:
        - 70
        - 130
        - 180
      label_id: 23
      mapping_class: sky
    - seg_class: person
      rgb:
        - 220
        - 20
        - 60
      label_id: 24
      mapping_class: person
    - seg_class: rider
      rgb:
        - 255
        - 0
        - 0
      label_id: 25
      mapping_class: rider
    - seg_class: car
      rgb:
        - 0
        - 0
        - 142
      label_id: 26
      mapping_class: car
    - seg_class: truck
      rgb:
        - 0
        - 0
        - 70
      label_id: 27
      mapping_class: car
    - seg_class: bus
      rgb:
        - 0
        - 60
        - 100
      label_id: 28
      mapping_class: bus
    - seg_class: train
      rgb:
        - 0
        - 80
        - 100
      label_id: 31
      mapping_class: train
    - seg_class: motorcycle
      rgb:
        - 0
        - 0
        - 230
      label_id: 32
      mapping_class: motorcycle
    - seg_class: bicycle
      rgb:
        - 119
        - 11
        - 32
      label_id: 33
      mapping_class: bicycle
  workers_per_gpu: 1
  batch_size: -1

Thanks @Morganh - will try and revert back with results.
Cheers

Hi @Morganh

I created the onnx and used tao deploy (rather than trtexec) to produce the TRT model.plan and used your spec file with modifications to point to my filesystem/container.

When I run polygraphy inspect model model.plan (from within a TRT container - tensorrt:24.01-py3) I get the following output:

[I] Loading bytes from /trt_optimize/model.plan
[I] ==== TensorRT Engine ====
Name: Unnamed Network 0 | Explicit Batch Engine

---- 1 Engine Input(s) ----
{input [dtype=float32, shape=(-1, 3, 1024, 1024)]}

---- 1 Engine Output(s) ----
{output [dtype=int32, shape=(1, -1, 1024, 1024)]}

---- Memory ----
Device Memory: 21979332608 bytes

---- 1 Profile(s) (2 Tensor(s) Each) ----
- Profile: 0
    Tensor: input           (Input), Index: 0 | Shapes: min=(1, 3, 1024, 1024), opt=(8, 3, 1024, 1024), max=(8, 3, 1024, 1024)
    Tensor: output         (Output), Index: 1 | Shape: (1, -1, 1024, 1024)

I’m assuming that the output dimension of -1 (dynamic) means that the output channels are configured at runtime and will output 3 channels when the input is 3 channels?

I’m also assuming I can get the segmentation color for each class from the spec file you provide above?

Thank you for all your help with this.

Cheers

Could you use trtexec to generate tensort engine to double check? Refer to TRTEXEC with Segformer - NVIDIA Docs.

Hi @Morganh
I used the trtexec that is exposed in the tao deploy container using:

!tao deploy segformer run trtexec --onnx=$SPECS_DIR/1024_1024.onnx
–maxShapes=input:16x3x1024x1024
–minShapes=input:1x3x1024x1024
–optShapes=input:8x3x1024x1024
–fp16
–saveEngine=$SPECS_DIR/model.plan

The 1024_1024.onnx file was generated using the spec file you provided above. I then used polygraphy by using:

!tao deploy segformer run polygraphy inspect model $SPECS_DIR/model.plan

And got the following output:

[I] Loading bytes from /workspace/tao-experiments/specs/model.plan
[I] ==== TensorRT Engine ====
Name: Unnamed Network 0 | Explicit Batch Engine

---- 1 Engine Input(s) ----
{input [dtype=float32, shape=(-1, 3, 1024, 1024)]}

---- 1 Engine Output(s) ----
{output [dtype=int32, shape=(1, -1, 1024, 1024)]}

---- Memory ----
Device Memory: 14773387264 bytes

---- 1 Profile(s) (2 Tensor(s) Each) ----
- Profile: 0
    Tensor: input           (Input), Index: 0 | Shapes: min=(1, 3, 1024, 1024), opt=(8, 3, 1024, 1024), max=(16, 3, 1024, 1024)
    Tensor: output         (Output), Index: 1 | Shape: (1, -1, 1024, 1024)
---- 254 Layer(s) ----

So the dynamic channel (-1) should cope with a 3 channel input tensor by inferencing out a 3 channel tensor, but it appears to give a grayscale output (ie. 1x1024x1024).

Any further thoughts? Thank you.

I will try on my side as well. Thanks for the info.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Hi @IainA
I run with below steps.

$ /usr/src/tensorrt/bin/trtexec --onnx=citysemsegformer_fan.onnx --minShapes=input:1x3x1024x1024 --optShapes=input:1x3x1024x1024 --maxShapes=input:1x3x1024x1024 --saveEngine=fp32.engine

The result is as below.

# polygraphy inspect model fp32.engine
[I] Loading bytes from /home/morganh/demo_3.0/forum_repro/segformer/fp32.engine
[I] ==== TensorRT Engine ====
    Name: Unnamed Network 0 | Explicit Batch Engine

    ---- 1 Engine Input(s) ----
    {input [dtype=float32, shape=(1, 3, 1024, 1024)]}

    ---- 1 Engine Output(s) ----
    {output [dtype=int32, shape=(1, 1, 1024, 1024)]}

    ---- Memory ----
    Device Memory: 2192719872 bytes

    ---- 1 Profile(s) (2 Binding(s) Each) ----
    - Profile: 0
        Binding Index: 0 (Input)  [Name: input]  | Shapes: min=(1, 3, 1024, 1024), opt=(1, 3, 1024, 1024), max=(1, 3, 1024, 1024)
        Binding Index: 1 (Output) [Name: output] | Shape: (1, 1, 1024, 1024)

    ---- 341 Layer(s) ----

If set different shapes,

$ /usr/src/tensorrt/bin/trtexec --onnx=citysemsegformer_fan.onnx --minShapes=input:2x3x1024x1024 --optShapes=input:3x3x1024x1024 --maxShapes=input:4x3x1024x1024 --saveEngine=fp32_dynamic.engine

$polygraphy inspect model fp32_dynamic.engine
[I] Loading bytes from /home/morganh/demo_3.0/forum_repro/segformer/fp32_dynamic.engine
[I] ==== TensorRT Engine ====
    Name: Unnamed Network 0 | Explicit Batch Engine

    ---- 1 Engine Input(s) ----
    {input [dtype=float32, shape=(-1, 3, 1024, 1024)]}

    ---- 1 Engine Output(s) ----
    {output [dtype=int32, shape=(1, -1, 1024, 1024)]}

    ---- Memory ----
    Device Memory: 9035645440 bytes

    ---- 1 Profile(s) (2 Binding(s) Each) ----
    - Profile: 0
        Binding Index: 0 (Input)  [Name: input]  | Shapes: min=(2, 3, 1024, 1024), opt=(3, 3, 1024, 1024), max=(4, 3, 1024, 1024)
        Binding Index: 1 (Output) [Name: output] | Shape: (1, -1, 1024, 1024)

    ---- 343 Layer(s) ----

So, the -1 is related to the batch-size.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.