UffParser: Validator error: block_4c_bn_3/cond/Switch: Unsupported operation _Switch

• Hardware (RTX 3090)
• Network Type (Mask_rcnn)
• TLT Version dockers: [['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm'] format_version: 2.0 toolkit_version: 3.21.11 published_date: 11/08/2021 ]
• Training spec file(
maskrcnn_retrain_resnet50.txt (2.0 KB)
)
• How to reproduce the issue –

So I trained the maskrcnn model with r50 backbone on a custom dataset and successfully pruned it.

Upon Retraining of pruned model I tried to export it to fp32/16 format
Here is the command I used for exporint:

tao mask_rcnn export -m /workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_pruned_p70/model.step-40000.tlt -k nvidia_tlt -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_retrain_resnet50.txt --batch_size 1 --engine_file /workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_pruned_p70/export/model.step-40000.engine

in this command I am getting this error:

[TensorRT] ERROR: UffParser: Validator error: block_4c_bn_3/cond/Switch: Unsupported operation _Switch
2021-12-13 13:15:00,471 [ERROR] iva.common.export.trt_utils: Failed to parse UFF File

Here is error log:


.
.
.
Warning: No conversion function registered for layer: GenerateDetection_TRT yet.
Converting generate_detections as custom op: GenerateDetection_TRT
Warning: No conversion function registered for layer: MultilevelProposeROI_TRT yet.
Converting multilevel_propose_rois as custom op: MultilevelProposeROI_TRT
Warning: No conversion function registered for layer: MultilevelCropAndResize_TRT yet.
Converting pyramid_crop_and_resize_box as custom op: MultilevelCropAndResize_TRT
DEBUG [/usr/local/lib/python3.6/dist-packages/uff/converters/tensorflow/converter.py:96] Marking ['generate_detections', 'mask_fcn_logits/BiasAdd'] as outputs
2021-12-13 13:14:59,450 [INFO] iva.mask_rcnn.export.exporter: Converted model was saved into /workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_pruned_p70/model.step-40000.etlt
[TensorRT] ERROR: UffParser: Validator error: block_4c_bn_3/cond/Switch: Unsupported operation _Switch
2021-12-13 13:15:00,471 [ERROR] iva.common.export.trt_utils: Failed to parse UFF File
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/trt_utils.py", line 301, in _load_from_files
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/trt_utils.py", line 301, in _load_from_files
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/export.py", line 12, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py", line 265, in launch_export
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py", line 247, in run_export
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/export/exporter.py", line 654, in export
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/trt_utils.py", line 291, in __init__
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/trt_utils.py", line 164, in __init__
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/trt_utils.py", line 309, in _load_from_files
AssertionError: UFF parsing failed on line 301 in statement 
2021-12-13 18:45:02,008 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Not sure why this is happening? Shouldn’t the trained tao model be by default ready to export?
Why would we get this error in the default model itself?
Dataset used for training is private.
Please see the attached config file to see the params used for training.

May I know if you can export the unpruned model successfully?
From your description, after pruning, you did not run retraining, right?
Please run retraining and then exporting the retrained model.

What I meant was that I retrained the pruned model and then exported it.

As shown in command I exported 40k iteration of the pruned model.

Also,

.etlt was exported successfully as shown in logs, but .engine file couldn’t because of that error.

Thank you for the quick reply.

To narrow down, could you please export the unpruned model as well?

tao mask_rcnn export -m /workspace/tao-experiments/mask_rcnn/experiments/unpruned.tlt -k nvidia_tlt -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt --batch_size 1 --engine_file /workspace/tao-experiments/mask_rcnn/experiments/out.engine

So I ran this command –

tao mask_rcnn export -m /workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_unpruned/model.step-40000.tlt -k nvidia_tlt -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt --batch_size 1 --data_type fp32 --engine_file /workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_unpruned/model.step-40000.engine

This is config file:

maskrcnn_train_resnet50.txt (2.0 KB)

Same error. This is weird, I set freeze_bn: False and freeze_blocks: "[]" can this cause this issue? cz from what I understand _Switch is related to BN

Here is the error trace:

Warning: No conversion function registered for layer: ResizeNearest_TRT yet.
Converting nearest_upsampling_1 as custom op: ResizeNearest_TRT
Warning: No conversion function registered for layer: ResizeNearest_TRT yet.
Converting nearest_upsampling as custom op: ResizeNearest_TRT
Warning: No conversion function registered for layer: SpecialSlice_TRT yet.
Converting mrcnn_detection_bboxes as custom op: SpecialSlice_TRT
Warning: No conversion function registered for layer: GenerateDetection_TRT yet.
Converting generate_detections as custom op: GenerateDetection_TRT
Warning: No conversion function registered for layer: MultilevelProposeROI_TRT yet.
Converting multilevel_propose_rois as custom op: MultilevelProposeROI_TRT
Warning: No conversion function registered for layer: MultilevelCropAndResize_TRT yet.
Converting pyramid_crop_and_resize_box as custom op: MultilevelCropAndResize_TRT
DEBUG [/usr/local/lib/python3.6/dist-packages/uff/converters/tensorflow/converter.py:96] Marking ['generate_detections', 'mask_fcn_logits/BiasAdd'] as outputs
2021-12-14 07:05:46,813 [INFO] iva.mask_rcnn.export.exporter: Converted model was saved into /workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_unpruned/model.step-40000.etlt
[TensorRT] ERROR: UffParser: Validator error: block_4c_bn_3/cond/Switch: Unsupported operation _Switch
2021-12-14 07:05:47,218 [ERROR] iva.common.export.trt_utils: Failed to parse UFF File
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/trt_utils.py", line 301, in _load_from_files
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/trt_utils.py", line 301, in _load_from_files
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/export.py", line 12, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py", line 265, in launch_export
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py", line 247, in run_export
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/export/exporter.py", line 654, in export
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/trt_utils.py", line 291, in __init__
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/trt_utils.py", line 164, in __init__
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/trt_utils.py", line 309, in _load_from_files
AssertionError: UFF parsing failed on line 301 in statement 
2021-12-14 12:35:48,658 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Should I share the original pre-trained model I used? Resnet50.hdf5 ?
I installed tao almost a week ago. As evident by the env, everything is latest and updated. Driver version is 495.44

PS: I renamed pruned model model.tlt to modelp70.tlt and mention the same path in retrain spec file. Hopefully this doesn’t have any impact.

Where did you download the pre-trained model? Could you share the link?

Downloaded almost a month ago but used this link:

ngc registry model download-version nvidia/tao/pretrained_instance_segmentation:resnet50 --dest $LOCAL_EXPERIMENT_DIR/pretrained_resnet50

BTW, did you ever train official released Jupyter notebook? Is it successful?

Please share the tao info via below commands.
$ tao info --verbose

This is the most curious part to me, here is the timeline:

Firstly, I trained a resnet18 on a very small dataset with Jupyter Notebook on my local machine and trained it successfully. It exported successfully as well.

After that, I moved to a fully equipped RTX 3090 remote machine and trained a resnet50 on full dataset. Using exact same commands, just ran those in the terminal instead of the notebook as it was a remote machine and use of tmux is a lot easier. After training for some reason it just failed to export as shown in error.

Output:

                                                                                                                                                    
dockers:                                                                                                                                                                
        nvidia/tao/tao-toolkit-tf:                                                                                                                                      
                v3.21.11-tf1.15.5-py3:                                                                                                                                  
                        docker_registry: nvcr.io                                                                                                                        
                        tasks:                                                                                                                                          
                                1. augment                                                                                                                              
                                2. bpnet                                                                                                                                
                                3. classification                                                                                                                       
                                4. dssd                                                                                                                                 
                                5. emotionnet                                                                                                                           
                                6. efficientdet                                                                                                                         
                                7. fpenet                                                                                                                               
                                8. gazenet                                                                                                                              
                                9. gesturenet                                                                                                                           
                                10. heartratenet                                                                                                                        
                                11. lprnet                                                                                                                              
                                12. mask_rcnn
                                13. multitask_classification
                                14. retinanet
                                15. ssd
                                16. unet
                                17. yolo_v3
                                18. yolo_v4
                                19. yolo_v4_tiny
                                20. converter
                v3.21.11-tf1.15.4-py3: 
                        docker_registry: nvcr.io
                        tasks: 
                                1. detectnet_v2
                                2. faster_rcnn
        nvidia/tao/tao-toolkit-pyt: 
                v3.21.11-py3: 
                        docker_registry: nvcr.io
                        tasks: 
                                1. speech_to_text
                                2. speech_to_text_citrinet
                                3. text_classification
                                4. question_answering
                                5. token_classification
                                6. intent_slot_classification
                                7. punctuation_and_capitalization
                                8. spectro_gen
                                9. vocoder 
                                10. action_recognition
        nvidia/tao/tao-toolkit-lm: 
                v3.21.08-py3: 
                        docker_registry: nvcr.io
                        tasks: 
                                1. n_gram

format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021

For your Reference, I bashed in tao docker and got the tensorrt version

8.0.1-1+cuda11.3

Thank you, let me know if you find anything that could lead to this.

Hey, so quick imp update from my end.

I tried from scratch. I downloaded the resnet50 model using ngc command mentioned above and same config txt file.
Then I trained the unpruned model for 4-5k iterations only.

After that, I tried to export that unpruned model and it gave me the same error. How can that be?

Could you refer to the spec file in Jupyter Notebook and have a retry? Is there any difference between your spec and the spec inside Jupyter notebook?

More, I suggest you to follow jupyter notebooks’s spec and use KITTI dataset to run in the terminal of the fully equipped RTX 3090 remote machine to check if it is successful as well.

yes the difference in the new system spec and notebook spec are as follows:

  1. notebook was resnet18 this is resnet50
  2. freeze_bn: True and freeze_blocks: “[0,1]” this was in default notebook spec file
    I changed that to

I deleted the spec file from local as it was just for testing purpose.
Made same changes in rtx 3090 system and that happened.

As mentioned I tried running totally fresh unpruned training for 4-5k iters and that same _Switch error came while exporting to TensorRT.

Hey, a quick update on this. I made following changes to the spec file:

Freeze_bn: True
freeze_blocks: [0,1]

from

Freeze_bn: False
freeze_blocks:

After retraining, it exported successfully this time. I think this happened due to freeze_bn: False parameter. But why is that?
I don’t want to freeze bn as this can have some impact on training accuracy. Can you please explain the same? That’d be great. BN helps in preventing overfitting and I don’t feel like freezing it will help at all.
PS: My data is totally diff in terms of features than coco/imagenet so I want to train full model and not freeze anything.

Thanks for your quick update! Could you please trigger another experiment as below?

Freeze_bn: True
freeze_blocks: [ ]

freeze_bn: Whether to freeze all BatchNorm layers in the backbone
freeze_blocks: A list of conv blocks in the backbone to freeze

It works with this setting, the problem is due to freeze_bn: False. This doesn’t sound good to me Doesn’t freezing BN during training have a -ve impact? Why is this happening? We can’t train a network with freezed layers from scratch on custom dataset.

Also, I have 1248 validation images thus I set my eval_sample: 1248 in config.
During validation log output something like:

[MaskRCNN] INFO    : Running inference on batch 623/624... -                Step Time: 0.0670s - Throughput: 29.9 imgs/s
[MaskRCNN] INFO    : Running inference on batch 624/624... -                Step Time: 0.0667s - Throughput: 30.0 imgs/s
[MaskRCNN] INFO    : Loading and preparing results...
[MaskRCNN] INFO    : 0/124800
MaskRCNN] INFO    : 1000/124800

This is 100x more than my eval samples and also this process is dead slow, takes a lot of time, more than the training itself. Why is that?

It is creating prediction results. The “100” comes from "test_detections_per_image"paramerter.

We’ll check further.

Will it have a negative impact on training/Validation matrix results if I set it to like 1 or 10? It’ll reduce my training time by a lot. ps. I don’t want to compromise with accuracy for the speed gain tho.

please do, I am training with freeze_bn: True as of now, will report if I see any drop in metric as soon as training completes.

It is not suggested to set too low. This parameter is the number of bounding box candidates after NMS. For training time, actually you can reduce the evaluation interval. Please set larger “num_steps_per_eval”.

That is exactly what I have done. Altho in my dataset, there is no image with more than 10 instances/bbox. In vast majority of data it is 1-2 bbox/image only.

Thank you for the clarity on this.