Deepstream can’t create the .engine from the .etlt, using tlt3.0 a custom Mask-Rcnn model [2]

Please provide complete information as applicable to your setup.

This is a continuation from a previous post that I could not keep attending and was closed.
It won’t happen again

• Hardware Platform (Jetson / GPU)
GPU gtx3060
• DeepStream Version
5.1
• TensorRT Version
7.2.3
• NVIDIA GPU Driver Version (valid for GPU only)
460.73.01
• Issue Type( questions, new requirements, bugs)
question

I am working on the docker for deepstream5.1. I CANNOT upgrade the version, please do not suggest that.

I trained a Mask-Rcnn model using TLT on a custom dataset, now I need to use that model on deepstream.
Since that version of deepstream doesn’t support mask-rcnn, I compiled a parser, following [these instructions].(Deploying to DeepStream for MaskRCNN - NVIDIA Docs)

I could compile with no problem, and changed my config_infer file to the following.

[property]
gpu-id=0
net-scale-factor=1.0
offsets=103.939;116.779;123.68
model-color-format=1
labelfile-path=<labels path>/labels.txt
tlt-encoded-model=<model path>/model.step-25000.etlt
#model-engine-file= <once it is generated I will add it here>
tlt-model-key=<secret key>
uff-input-dims=3;1024;1920;0
uff-input-blob-name=Input
batch-size=1
#network-mode=2
num-detected-classes=5
interval=0
gie-unique-id=1
is-classifier=0

## parser
output-blob-names=generate_detections;mask_fcn_logits/BiasAdd
cluster-mode=4
network-type=3 ## 3 is for instance segmentation network
output-instance-mask=1
parse-bbox-instance-mask-func-name=NvDsInferParseCustomMrcnnTLT
custom-lib-path=/tmp/deepstream_tlt_apps/post_processor/libnvds_infercustomparser_tlt.so

[class-attrs-all]
pre-cluster-threshold=0.6
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=35
detected-min-h=35
#detected-max-w=1000
detected-max-h=850

my problem is that deepstream is not able to generate the engine file, it fails with the following error

[NvDCF] Initialized
0:00:00.272590254  5144 0x55df4f6f0f90 INFO                 nvinfer gstnvinfer.cpp:619:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1716> [UID = 1]: Trying to create engine from model files
ERROR: ../nvdsinfer/nvdsinfer_func_utils.cpp:33 [TRT]: UffParser: Output error: Output mask_fcn_logits/BiasAdd not found
parseModel: Failed to parse UFF model
ERROR: tlt/tlt_decode.cpp:274 failed to build network since parsing model errors.
ERROR: ../nvdsinfer/nvdsinfer_model_builder.cpp:797 Failed to create network using custom network creation function
ERROR: ../nvdsinfer/nvdsinfer_model_builder.cpp:862 Failed to get cuda engine from custom library API
0:00:01.537341226  5144 0x55df4f6f0f90 ERROR                nvinfer gstnvinfer.cpp:613:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Error in NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1736> [UID = 1]: build engine file failed
corrupted size vs. prev_size
Aborted (core dumped)

One thing I noticied is that in the guide, the following is written:
parse-bbox-instance-mask-func-name=NvDsInferParseCustomMrcnnTLT

but on the source code of that git, that function does not exist. Instead there is NvDsInferParseCustomMrcnnTLTV2

So, the questions are:
Which tag of the git should I be using? (instructions say git clone -b release/tlt3.0 GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStream)
or, which tutorial should I be following?

thank you.

@fanzh resplied last time, suggestion is “please correct output-blob-names in the configuration file.”
@fanzh which name should I place there? How do I know which name should be placed there?
Checking the functions in nvdsinfer_custombboxparser_tlt.cpp I saw this:

bool NvDsInferParseCustomMrcnnTLTV2 (std::vector<NvDsInferLayerInfo> const &outputLayersInfo,
                                   NvDsInferNetworkInfo  const &networkInfo,
                                   NvDsInferParseDetectionParams const &detectionParams,
                                   std::vector<NvDsInferInstanceMaskInfo> &objectList) {
    auto layerFinder = [&outputLayersInfo](const std::string &name)
        -> const NvDsInferLayerInfo *{
        for (auto &layer : outputLayersInfo) {
            if (layer.dataType == FLOAT &&
              (layer.layerName && name == layer.layerName)) {
                return &layer;
            }
        }
        return nullptr;
    };

    const NvDsInferLayerInfo *detectionLayer = layerFinder("generate_detections");
    const NvDsInferLayerInfo *maskLayer = layerFinder("mask_fcn_logits/BiasAdd");

with that, I am not sure if it is the name of a function in the parser or the name of a layer in the model.
if it is the second case, that means that I lost a layer? is that possible?

I tried adding std::cout << "Layer name: " << layer.layerName << std::endl; before just the if in that code, but I to the error line before the execution of the line.
Checking nvdsinfer_context_impl and nvdsinfer_model_builder, I see that the builder is initialized with the path of the nvdsinfer_custombboxparser_tlt.so, so I would assume that the function NvDsInferParseCustomMrcnnTLT (or NvDsInferParseCustomMrcnnTLTV2) should be executed firstly.
Am I wrong? Am I missing something?

Thank you!

output-blob-names represents “Array of output layer names”, please find it in nvinfer

Thank you.
It seems I understood the code right then. I’ll have to look further on the docs.
There is detail I need help with.
As stated, I trained the network following the Mask-RCNN example, and changed nothing but the dataset and the strictly necessary parameters to train a custom dataset, so I don’t see how mu model doesn’t have that output layer “output-blob-names”, nor were do I set the layers names, nor where can I check them so I can place the right one in the config file.
Where do I find that info?
Thank you again for your support.

that “generate_detections;mask_fcn_logits/BiasAdd” is for peoplesegnet_resnet50.etlt, is your model based on peoplesegnet_resnet50.etlt?

I used TLT notebook example, but the pretrained model used there, , is not available anymore.
I checked the models in your repository and ended downloading this one, since it was the only one available for download.

https://api.ngc.nvidia.com/v2/models/nvidia/tao/pretrained_instance_segmentation/versions/resnet50/files/resnet50.hdf5

Can you provide me the peoplesegnet_resnet50.etlt or the parser necesary for the model I used please?

thank you

please refer to deepstream_tao_apps/download_models.sh at master · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub

I tried with https://api.ngc.nvidia.com/v2/models/nvidia/tao/peoplesegnet/versions/deployable_v2.0.2, from the git, but as the name says, it is an etlt and it cannot be trained on.

For multi-GPU, change --gpus based on your machine.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

Using TensorFlow backend.
Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 355, in train_and_eval
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 266, in get_training_hooks
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 248, in load_pretrained_model
ValueError: Pretrained weights in only .hdf or .tlt format are supported.

[MaskRCNN] ERROR   : Job finished with an uncaught exception: `FAILURE`
Traceback (most recent call last):
  File "/usr/local/bin/mask_rcnn", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.

Then I downloaded the three trainable version of peoplesegnet according to
https://api.ngc.nvidia.com/v2/models/nvidia/tao/peoplesegnet/versions

with this command

!ngc registry model download-version nvidia/tao/peoplesegnet:trainable_v1.0 --dest $USER_EXPERIMENT_DIR/pretrained_resnet50
!ngc registry model download-version nvidia/tao/peoplesegnet:trainable_v2.0 --dest $USER_EXPERIMENT_DIR/pretrained_resnet50
!ngc registry model download-version nvidia/tao/peoplesegnet:trainable_v2.1 --dest $USER_EXPERIMENT_DIR/pretrained_resnet50

but I got a similar output for the three of them.

For multi-GPU, change --gpus based on your machine.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

Using TensorFlow backend.
Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 236, in load_pretrained_model
  File "/usr/lib/python3.6/zipfile.py", line 1131, in __init__
    self._RealGetContents()
  File "/usr/lib/python3.6/zipfile.py", line 1198, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 196, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 192, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 91, in run_executer
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 355, in train_and_eval
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 266, in get_training_hooks
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 244, in load_pretrained_model
OSError: The last checkpoint file is not saved properly.                     Please delete it and rerun the script.

[MaskRCNN] ERROR   : Job finished with an uncaught exception: `FAILURE`
Traceback (most recent call last):
  File "/usr/local/bin/mask_rcnn", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.

I used the same config file, and the same tfrecords. I only changed the path in the checkpoint field.

I find it curious since the output for training etlt says I should try with a .hdf o r .tlt model, so I assume that one should work.

This give me the following questions.
Why the exception asks for a .zip?
As I understand the .hfd (or .hdf5) file is a compressed format. has it something to do with that?
Am I missing something?
Do you have the .hdf so I can try that one?

Hi @Morganh ,
Can you help comment above issue?

@ai12
For mask_rcnn pretrained model, please use TAO Pretrained Instance Segmentation | NVIDIA NGC

@fanzh, @yingliu, @Morganh Thank you all for your help, I really appreciate it.

@Morganh, unfortunely, those models are the same I used the first time. Tlt trained on them with no issue, but the parser fails, it seems to be looking unsuccessfully for the layer mask_fcn_logits/BiasAdd.

Is there any particular model among those that should work on deepstream with the parser?

thank you.

Seems that you are using old version of TLT. So, if you are running inference with deepstream, please use old version branch of the github.

yes, I am using an old version of both, TLT and Deepstream.
Both versions came together, they are TLT 3.0 and Deepstream 5.1

On the parser git I used the old branch too, for tlt3.0.
I checked the prerequisite plugins too. All of them were introduced on an earlier release than TensorRt OSS 7.2.3, which is the one I have installed.

Can you address these from above posts question please?

  • how can I see the layer names from a model (.hdf5 an .tlt)?
  • do you have the peoplesegnet.hdf5?
  • do the models on pretrained_instance_segmentation have the layer mask_fcn_logits/BiasAdd? how can I check if a model actually has that layer?

thank you

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Transfer to TAO toolkit forum.
You can check the log when run training with

No, there is not peoplesegnet.hdf5.

Please see deepstream_tao_apps/pgie_peopleSegNetv2_tlt_config.txt at release/tlt3.0 · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub and run with GitHub - NVIDIA-AI-IOT/deepstream_tao_apps at release/tlt3.0

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.