TX2 "INT8 not supported by platform. Trying FP16 mode"

Hello Morganh,
I am starting a new post, originally coming from here: https://devtalk.nvidia.com/default/topic/1064467/deepstream-sdk/resnet10-quot-primary-detector-quot-/post/5404002/#5404002

I did as you said and converted the engine with tlt-converter.

Running:
$ /usr/src/tensorrt/bin/trtexec --int8 --loadEngine= --calib= --batch=1 --iterations=20 --output=output_cov/Sigmoid,output_bbox/BiasAdd --useSpinWait

I get an average execution time of 20ms. It doesn’t sound very fast (–> 50 fps on 1 stream, 10 fps on 5 streams).

When I run the same engine with deepstream I get the following errors.

Creating LL OSD context new
0:00:01.349564544  9511   0x55a4c146c0 WARN                 nvinfer gstnvinfer.cpp:515:gst_nvinfer_logger:<primary_gie_classifier> NvDsInferContext[UID 1]:useEngineFile(): Failed to read from model engine file
0:00:01.349669440  9511   0x55a4c146c0 INFO                 nvinfer gstnvinfer.cpp:519:gst_nvinfer_logger:<primary_gie_classifier> NvDsInferContext[UID 1]:initialize(): Trying to create engine from model files
0:00:01.349941088  9511   0x55a4c146c0 WARN                 nvinfer gstnvinfer.cpp:515:gst_nvinfer_logger:<primary_gie_classifier> NvDsInferContext[UID 1]:generateTRTModel(): INT8 not supported by platform. Trying FP16 mode.
0:00:01.349987200  9511   0x55a4c146c0 ERROR                nvinfer gstnvinfer.cpp:511:gst_nvinfer_logger:<primary_gie_classifier> NvDsInferContext[UID 1]:generateTRTModel(): No model files specified
0:00:01.350042720  9511   0x55a4c146c0 ERROR                nvinfer gstnvinfer.cpp:511:gst_nvinfer_logger:<primary_gie_classifier> NvDsInferContext[UID 1]:initialize(): Failed to create engine from model files

Why does it say “INT8 not supported by platform. Trying FP16 mode.”?

The config of the primary gie is the following.

[property]
    gpu-id=0
    net-scale-factor=0.0039215697906911373
    int8-calib-file=/detectnet_v2_resnet_10/calibration.bin
    labelfile-path=/detectnet_v2_resnet_10/classes.txt
    model-engine-file=engine/resnet10_int8.engine
    tlt-model-key=blablabla
    batch-size=5
    uff-input-blob-name=input_1
    uff-input-dims=3;608;608;0
    process-mode=1
    model-color-format=0
    network-mode=1
    num-detected-classes=2
    interval=0
    gie-unique-id=1
    output-blob-names=output_cov/Sigmoid;output_bbox/BiasAdd

Hi,

Which device do you use?
Please noticed that not all the GPU can support INT8 operation.

The GPU architecture need to be 7.x or P4.
Please check this for the detail information:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html#hardware-precision-matrix

Thanks.

I use Jetson TX2. But why is trtexec working fine?

$ /usr/src/tensorrt/bin/trtexec --int8 --loadEngine=resnet10_in18.engine --calib= --batch=1 --iterations=20 --output=output_cov/Sigmoid,output_bbox/BiasAdd --useSpinWait

Furthermore is the benchmark primary detector also loading a INT8 engine (“resnet10.caffemodel_b30_int8.engine”).

I am confused.

Hi rog07o4z,
For the performance of generated TRT engine, could you please tell me more detailed info?

  1. Which Jetson Platform, nano or Xavier or others?
  2. What is the prune ratio? You can check it in pruning log.
  3. What is size of the pruned model?
  4. Can you share your full command of “tlt-converter”? I want to check your batch-size and TensorRT data type.

For the error running TRT engine with ds,
Can you paste the full log?

  1. Jetson TX2

  2. Which pruning log??

  3. ResNet10: After pruning there are 19368 weights left.

  4. ./tlt-converter resnet10_detector.etlt -e resnet10_int8.engine -k MYKEY -c resnet10_calibration.bin -o output_cov/Sigmoid,output_bbox/BiasAdd -d 3,608,608 -b 8 -m 4 -t int8 -i nchw

  5. ds full log:

./deepstream-test5-app -c test5_config_file_src_infer_custom_detectnet_resnet10.txt 

(deepstream-test5-app:10645): GLib-GObject-WARNING **: 10:04:36.161: g_object_set_is_valid_property: object class 'avenc_mpeg4' has no property named 'iframeinterval'

(deepstream-test5-app:10645): GLib-GObject-WARNING **: 10:04:36.161: g_object_set_is_valid_property: object class 'avenc_mpeg4' has no property named 'bufapi-version'
Creating LL OSD context new
0:00:01.361678272 10645   0x55c0de16c0 WARN                 nvinfer gstnvinfer.cpp:515:gst_nvinfer_logger:<primary_gie_classifier> NvDsInferContext[UID 1]:useEngineFile(): Failed to read from model engine file
0:00:01.361774144 10645   0x55c0de16c0 INFO                 nvinfer gstnvinfer.cpp:519:gst_nvinfer_logger:<primary_gie_classifier> NvDsInferContext[UID 1]:initialize(): Trying to create engine from model files
0:00:01.362070752 10645   0x55c0de16c0 WARN                 nvinfer gstnvinfer.cpp:515:gst_nvinfer_logger:<primary_gie_classifier> NvDsInferContext[UID 1]:generateTRTModel(): INT8 not supported by platform. Trying FP16 mode.
0:00:01.362138176 10645   0x55c0de16c0 ERROR                nvinfer gstnvinfer.cpp:511:gst_nvinfer_logger:<primary_gie_classifier> NvDsInferContext[UID 1]:generateTRTModel(): No model files specified
0:00:01.362192672 10645   0x55c0de16c0 ERROR                nvinfer gstnvinfer.cpp:511:gst_nvinfer_logger:<primary_gie_classifier> NvDsInferContext[UID 1]:initialize(): Failed to create engine from model files
0:00:01.362252320 10645   0x55c0de16c0 WARN                 nvinfer gstnvinfer.cpp:692:gst_nvinfer_start:<primary_gie_classifier> error: Failed to create NvDsInferContext instance
0:00:01.362286688 10645   0x55c0de16c0 WARN                 nvinfer gstnvinfer.cpp:692:gst_nvinfer_start:<primary_gie_classifier> error: Config file path: detectnet_v2_resnet_10.txt, NvDsInfer Error: NVDSINFER_CONFIG_FAILED

can't set pipeline to playing state.
Quitting
ERROR from primary_gie_classifier: Failed to create NvDsInferContext instance
Debug info: /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvinfer/gstnvinfer.cpp(692): gst_nvinfer_start (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInfer:primary_gie_classifier:
Config file path: detectnet_v2_resnet_10.txt, NvDsInfer Error: NVDSINFER_CONFIG_FAILED
App run failed
  1. /usr/src/tensorrt/bin/trtexec --int8 --loadEngine=resnet10_int8.engine --calib=resnet10_calibration.bin --batch=1 --iterations=20 --output=output_cov/Sigmoid,output_bbox/BiasAdd --useSpinWait

LOG:

[I] int8
[I] loadEngine: resnet10_int8.engine
[I] calib: resnet10_calibration.bin
[I] batch: 1
[I] iterations: 20
[I] output: output_cov/Sigmoid,output_bbox/BiasAdd
[I] useSpinWait
[I] resnet10_int8.engine has been successfully loaded.
[I] Average over 10 runs is 191.889 ms (host walltime is 192.029 ms, 99% percentile time is 192.058).
[I] Average over 10 runs is 191.845 ms (host walltime is 191.889 ms, 99% percentile time is 192.084).
[I] Average over 10 runs is 191.94 ms (host walltime is 191.987 ms, 99% percentile time is 192.217).
[I] Average over 10 runs is 191.886 ms (host walltime is 191.933 ms, 99% percentile time is 192.066).
[I] Average over 10 runs is 191.846 ms (host walltime is 191.889 ms, 99% percentile time is 191.963).
[I] Average over 10 runs is 44.4582 ms (host walltime is 44.4947 ms, 99% percentile time is 191.925).
[I] Average over 10 runs is 19.5758 ms (host walltime is 19.6063 ms, 99% percentile time is 19.612).
[I] Average over 10 runs is 19.5645 ms (host walltime is 19.597 ms, 99% percentile time is 19.58).
[I] Average over 10 runs is 19.57 ms (host walltime is 19.6024 ms, 99% percentile time is 19.5988).
[I] Average over 10 runs is 19.5688 ms (host walltime is 19.6004 ms, 99% percentile time is 19.5856).
[I] Average over 10 runs is 19.5784 ms (host walltime is 19.6093 ms, 99% percentile time is 19.6333).
[I] Average over 10 runs is 19.5812 ms (host walltime is 19.6125 ms, 99% percentile time is 19.6102).
[I] Average over 10 runs is 19.5711 ms (host walltime is 19.602 ms, 99% percentile time is 19.6002).
[I] Average over 10 runs is 19.5767 ms (host walltime is 19.6081 ms, 99% percentile time is 19.6218).
[I] Average over 10 runs is 19.5617 ms (host walltime is 19.5921 ms, 99% percentile time is 19.5978).
[I] Average over 10 runs is 19.6818 ms (host walltime is 19.7295 ms, 99% percentile time is 20.1788).
[I] Average over 10 runs is 19.585 ms (host walltime is 19.6164 ms, 99% percentile time is 19.6313).
[I] Average over 10 runs is 19.5785 ms (host walltime is 19.6101 ms, 99% percentile time is 19.6181).
[I] Average over 10 runs is 19.5814 ms (host walltime is 19.6127 ms, 99% percentile time is 19.6403).
[I] Average over 10 runs is 19.5799 ms (host walltime is 19.6115 ms, 99% percentile time is 19.6614).

Please refer to that post for more information.
https://devtalk.nvidia.com/default/topic/1064467/deepstream-sdk/resnet10-quot-primary-detector-quot-/post/5404002/#5404002

Thanks

Thanks rog07o4z. The pruning log locates at where you run the command “tlt-prune”.
Also, please check the size(how many MB) of pruned tlt model too.

Pruning ratio (pruned model / original model): 1.0
Size of the pruned model:
total 19368
-rw-r–r-- 1 root root 19829544 Nov 22 09:23 resnet10_nopool_bn_detectnet_v2_pruned.tlt

Hi rog07o4z,
Seems that your trained model did not prune because the pruning ratio is 1.0.
What’s your “-pth” value setting in tlt-prune command? Did you ever run re-training against pruned model?
The resnet10_nopool_bn_detectnet_v2_pruned.tlt you mentioned is a pruned tlt model which has not been retrained.

If you have retrained, could you also paste the size of the retrained model (i.e, resnet18_detector_pruned.tlt by default) and the size of exported etlt model(resnet18_detector.etlt)?

I’m asking this because pruned model will get higher performance than unpruned model.
I list the process for your better understanding: unpruned model → pruned → retrain → retrained model → exported etlt model

Hi,
Yes I understand the concept of pruning and I also retrained the model before applying it in the ds pipeline.

The prune-command is:
!tlt-prune -pm $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/resnet10_detector.tlt
-o $USER_EXPERIMENT_DIR/experiment_dir_pruned/
-eq union
-pth 0.0000052
-k $KEY

Size if retrained model:
total 19368
-rw-r–r-- 1 root root 19829328 Nov 22 10:31 resnet10_detector_pruned.tlt

Size of the untrained model:
total 39268
-rw------- 1 root root 253 Nov 22 10:31 license.txt
-rw------- 1 root root 40205392 Nov 22 10:31 resnet10.hdf5

Since the number of weights decreases from 39268 to 19368, I expected that the pruning worked fine.

Hi rog07o4z,
The resnet10.hdf5 is the pre-trained model. It is not related to prune ratio.
Your prune ratio is 1.0. That means you have not pruned the trained model.

Now I can see the whole process from you .

  1. You trained your own data and get unpruned model(not sure the size).
  2. After pruned(pruned ratio is 1), you get 19M pruned model, resnet10_nopool_bn_detectnet_v2_pruned.tlt
  3. After retraining, you get 19M newly retrained model, resnet10_detector_pruned.tlt
  4. Then you export it as resnet10_detector.etlt (not sure the size)

See step 2, could you try to prune more? Refer to section 9 of tlt doc for more detailed.

Hi Morganh,
Just for feedback. Increasing the parameter -pth immensly helped to get a pruning ration < 1.
Thanks. Now I know where to tweak.