No SpeedUp after TensorRT INT8 (PointNet ++ tensorflow model)

Hi,
I use i7 + GTX1660 ti.
Software info:
Tensorrt 6
tensorflow 1.15.0
cuda 10.1
cudnn 7.6.3
ubuntu 16 (docker)

I transfer my model to tensorrt engine using tftrt. The log is as shown:

2020-02-18 04:06:40.806897: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:633] Number of TensorRT candidate segments: 76
2020-02-18 04:06:41.444418: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 10 nodes succeeded.
2020-02-18 04:06:41.444539: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node TRTEngineOp_1 added for segment 1 consisting of 19 nodes succeeded.
2020-02-18 04:06:41.444757: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node TRTEngineOp_2 added for segment 2 consisting of 17 nodes succeeded.
2020-02-18 04:06:41.444921: I tensorflow/compiler/tf2tensorrt/convert/convert_graph.cc:734] TensorRT node TRTEngineOp_3 added for segment 3 consisting of 17 nodes succeeded.

2020-02-18 04:06:42.007572: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:786] Optimization results for grappler item: TRTEngineOp_24_native_segment
2020-02-18 04:06:42.007581: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:788] constant_folding: Graph size after: 20 nodes (0), 19 edges (0), time = 0.919ms.
2020-02-18 04:06:42.007587: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:788] layout: Graph size after: 20 nodes (0), 19 edges (0), time = 0.652ms.
2020-02-18 04:06:42.007593: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:788] constant_folding: Graph size after: 20 nodes (0), 19 edges (0), time = 0.765ms.
2020-02-18 04:06:42.007599: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:788] TensorRTOptimizer: Graph size after: 20 nodes (0), 19 edges (0), time = 0.095ms.
2020-02-18 04:06:42.007605: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:788] constant_folding: Graph size after: 20 nodes (0), 19 edges (0), time = 0.828ms.

graph_size(MB)(native_tf): 52.6
graph_size(MB)(trt): 52.8
num_nodes(native_tf): 4243
num_nodes(tftrt_total): 3429
num_nodes(trt_only): 76

Calibrating INT8…
2020-02-18 04:06:43.967599: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e44009490
2020-02-18 04:06:43.967669: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-02-18 04:06:43.968091: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-02-18 04:06:58.017542: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-18 04:06:58.058014: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e3c008780
2020-02-18 04:06:58.079357: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e74007ed0
2020-02-18 04:06:58.152381: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e6c01f3a0
2020-02-18 04:06:58.188232: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-18 04:06:58.188977: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e5006f210
2020-02-18 04:06:58.258622: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e6c0279a0
2020-02-18 04:06:58.301596: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e6c027860
2020-02-18 04:06:58.382948: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e6c03b280
2020-02-18 04:06:58.432091: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e6c03fea0
2020-02-18 04:06:58.467576: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e5007d720

2020-02-18 04:07:12.796185: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e3c0b45c0
2020-02-18 04:07:12.836554: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e3c0bbc70
2020-02-18 04:07:13.362552: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e3c0c3520
2020-02-18 04:07:14.132879: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e3c0c48f0
2020-02-18 04:07:14.132989: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e4c001180
2020-02-18 04:07:14.331160: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:812] Starting calibration thread on device 0, Calibration Resource @ 0x7f5e3c0e3660

However, I test the engine model has the same speed with tensorflow pb model.
The log as infer as shown:

2020-02-18 04:10:37.066778: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:733] Building a new TensorRT engine for TRTEngineOp_0 input shapes: [[1,8192,3]]
2020-02-18 04:10:37.066912: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-02-18 04:10:37.067393: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-02-18 04:10:55.899338: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:733] Building a new TensorRT engine for fp_2/TRTEngineOp_14 input shapes: [[1,256,3]]
2020-02-18 04:10:55.916907: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:733] Building a new TensorRT engine for fp_1/TRTEngineOp_10 input shapes: [[1,1024,3]]
2020-02-18 04:10:55.988548: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:733] Building a new TensorRT engine for fp_0/TRTEngineOp_6 input shapes: [[1,8192,3]]
2020-02-18 04:10:56.048204: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-18 04:10:56.049951: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:733] Building a new TensorRT engine for TRTEngineOp_57 input shapes: [[1,64,8192,4]]
2020-02-18 04:10:56.470749: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:733] Building a new TensorRT engine for TRTEngineOp_58 input shapes: [[1,64,8192,2]]

By the way, I also test FP32, FP16, INT8, all of them have same speed. I think I convert the model successully but no speed up.
Is there some problem in my model or in TensorRT.
Could you please give me some ideas?
Apprecaite for your help in advace.

Hi,

Can you check the number of node that’s being optimized or replaced with TRT nodes?
Please refer below link for the sample code
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#tensorrt-plan

trt_graph = getTrtGraphDef()
for n in trt_graph.node:
  if n.op == "TRTEngineOp":
    print("Node: %s, %s" % (n.op, n.name.replace("/", "_")))
    with tf.gfile.GFile("%s.plan" % (n.name.replace("/", "_")), 'wb') as f:
      f.write(n.attr["serialized_segment"].s)
  else:
    print("Exclude Node: %s, %s" % (n.op, n.name.replace("/", "_")))

Default value of “minimum_segment_size” 3, try reducing the “minimum_segment_size” to 2.

Thanks

Actually, I already print this info in my log using:

print("graph_size(MB)(native_tf): %.1f" % (float(graph_size) / (1 << 20)))
print("graph_size(MB)(trt): %.1f" %
     (float(len(engine_graph.SerializeToString())) / (1 << 20)))
print("num_nodes(native_tf): %d" % num_nodes)
print("num_nodes(tftrt_total): %d" % len(engine_graph.node))
print("num_nodes(trt_only): %d" % len([1 for n in engine_graph.node if str(n.op) == 'TRTEngineOp']))

graph_size(MB)(native_tf): 52.6
graph_size(MB)(trt): 52.8
num_nodes(native_tf): 4243
num_nodes(tftrt_total): 3429
num_nodes(trt_only): 76

I also add your code after convert() function:

The log is shown as following:
Node: TRTEngineOp, fp_0_TRTEngineOp_6
Node: TRTEngineOp, fp_1_TRTEngineOp_10
Node: TRTEngineOp, fp_2_TRTEngineOp_14
Node: TRTEngineOp, TRTEngineOp_57
Node: TRTEngineOp, TRTEngineOp_58
Node: TRTEngineOp, TRTEngineOp_18
Node: TRTEngineOp, TRTEngineOp_59
Node: TRTEngineOp, TRTEngineOp_60
Node: TRTEngineOp, TRTEngineOp_28
Node: TRTEngineOp, TRTEngineOp_73
Node: TRTEngineOp, sa_0_TRTEngineOp_29
Node: TRTEngineOp, TRTEngineOp_19
Node: TRTEngineOp, TRTEngineOp_61
Node: TRTEngineOp, TRTEngineOp_62
Node: TRTEngineOp, TRTEngineOp_20
Node: TRTEngineOp, TRTEngineOp_63
Node: TRTEngineOp, TRTEngineOp_64
Node: TRTEngineOp, ps_res_1_TRTEngineOp_21
Node: TRTEngineOp, TRTEngineOp_30
Node: TRTEngineOp, TRTEngineOp_74
Node: TRTEngineOp, sa_1_TRTEngineOp_31
Node: TRTEngineOp, TRTEngineOp_22
Node: TRTEngineOp, TRTEngineOp_65
Node: TRTEngineOp, TRTEngineOp_66
Node: TRTEngineOp, TRTEngineOp_23
Node: TRTEngineOp, TRTEngineOp_67
Node: TRTEngineOp, TRTEngineOp_68
Node: TRTEngineOp, TRTEngineOp_24
Node: TRTEngineOp, TRTEngineOp_25
Node: TRTEngineOp, TRTEngineOp_69
Node: TRTEngineOp, TRTEngineOp_70
Node: TRTEngineOp, TRTEngineOp_26
Node: TRTEngineOp, TRTEngineOp_71
Node: TRTEngineOp, TRTEngineOp_72
Node: TRTEngineOp, TRTEngineOp_1
Node: TRTEngineOp, TRTEngineOp_32
Node: TRTEngineOp, TRTEngineOp_75
Node: TRTEngineOp, sa_2_TRTEngineOp_33
Node: TRTEngineOp, TRTEngineOp_13
Node: TRTEngineOp, TRTEngineOp_47
Node: TRTEngineOp, TRTEngineOp_15
Node: TRTEngineOp, TRTEngineOp_48
Node: TRTEngineOp, TRTEngineOp_49
Node: TRTEngineOp, TRTEngineOp_50
Node: TRTEngineOp, TRTEngineOp_16
Node: TRTEngineOp, TRTEngineOp_51
Node: TRTEngineOp, TRTEngineOp_52
Node: TRTEngineOp, TRTEngineOp_53
Node: TRTEngineOp, TRTEngineOp_17
Node: TRTEngineOp, TRTEngineOp_54
Node: TRTEngineOp, TRTEngineOp_55
Node: TRTEngineOp, TRTEngineOp_56
Node: TRTEngineOp, TRTEngineOp_2
Node: TRTEngineOp, TRTEngineOp_9
Node: TRTEngineOp, TRTEngineOp_40
Node: TRTEngineOp, TRTEngineOp_11
Node: TRTEngineOp, TRTEngineOp_41
Node: TRTEngineOp, TRTEngineOp_42
Node: TRTEngineOp, TRTEngineOp_43
Node: TRTEngineOp, TRTEngineOp_12
Node: TRTEngineOp, TRTEngineOp_44
Node: TRTEngineOp, TRTEngineOp_45
Node: TRTEngineOp, TRTEngineOp_46
Node: TRTEngineOp, TRTEngineOp_3
Node: TRTEngineOp, TRTEngineOp_5
Node: TRTEngineOp, TRTEngineOp_35
Node: TRTEngineOp, TRTEngineOp_36
Node: TRTEngineOp, TRTEngineOp_7
Node: TRTEngineOp, TRTEngineOp_37
Node: TRTEngineOp, TRTEngineOp_38
Node: TRTEngineOp, TRTEngineOp_39
Node: TRTEngineOp, TRTEngineOp_8
Node: TRTEngineOp, TRTEngineOp_34
Node: TRTEngineOp, TRTEngineOp_27
Node: TRTEngineOp, euler_to_matrix_TRTEngineOp_4

I already tried different minimum_segment_size 2 3 5, the result is same.

Hi,

Setting the precision, requests TensorRT to use a layer implementation whose inputs and outputs match the preferred types. By default, TensorRT will choose such an implementation only if it results in a higher-performance network. If an implementation at a higher precision is faster, TensorRT will use it.

Also, the number of TRT only node indicates that almost 80% of the nodes are still executed as TensorfFlow nodes. The converted nodes are grouped into 76 TRT nodes, which means 76 times switching between TF and TRT. This has a slight overhead and might eat away the performance gain that you have from TRT.

Thanks

Thanks for your reply firstly!

I tried different minimum_segment_size like 3 5 10 15 20…
As you said, if trt node become less, the fps is a little bit better.
But the fps is not as expected, in my opinion, int8 is almost twice faster than fp32.

So, is there any way for my model to use TensorRT to become faster? Or trt could do noting right now for my model?

Appreciate for your help in advance.

Additionally, if I set minimum_segment_size is 2, trt will crash as following:

tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: Resource TF-TRT/TRTEngineOp_38/N10tensorflow8tensorrt22TRTEngineCacheResourceE does not exist.
[[{{node GetCalibrationDataOp}}]]
(1) Not found: Resource TF-TRT/TRTEngineOp_38/N10tensorflow8tensorrt22TRTEngineCacheResourceE does not exist.
[[{{node GetCalibrationDataOp}}]]
[[GetCalibrationDataOp/_45]]
0 successful operations.
0 derived errors ignored.

I use PointNet ++ model,
https://github.com/charlesq34/pointnet2

Does Anyone have some ideas why the model cannot be speedup by tensorrt?
How should I change the model that I could use Tensorrt?

Hi,

Other than the suggestions above, the best way to get more performance is probably for a larger portion of the model (or the entire model) to be supported by TensorRT, to have more optimized implementations, as well as less switching with TF.

A simple approach might be to try using TF-TRT with TensorRT 7 (nvcr.io/nvidia/tensorflow:20.01-tf1-py3 since you mentioned Docker), and see if any more ops are supported between TRT6 anmd TRT7.

You could also try to convert your model to ONNX with tf2onnx, and try to convert ONNX->TRT with TensorRT 7, but I’m not sure this is likely to work since such a low percentage of the model above was converted with TF-TRT. Still worth a try if you have the time, as a full TensorRT engine will likely be the most performant.