I measured fp16 and int8 as they can also be made.
FP16
/usr/src/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model_fp16.engine --fp16
python trt_classificator_av.py --model=model_fp16.engine --image=testimage.jpg
load: 2.4407081604003906 sec
1st frame: 0.09991908073425293 sec
infer: 0.0901947021484375 sec
[array([[0.01608747, 0.01584241, 0.01523666, ..., 0.0150394 , 0.01474018,
0.01657081],
[0.01589403, 0.01591903, 0.01520303, ..., 0.01500689, 0.01453697,
0.01645847],
[0.01578404, 0.01565488, 0.01522088, ..., 0.01504286, 0.0145905 ,
0.01607842],
...,
[0.0162069 , 0.01540244, 0.01553817, ..., 0.01534508, 0.01526381,
0.01682989],
[0.01589699, 0.01576799, 0.01550861, ..., 0.01529452, 0.01531881,
0.01666016],
[0.0156943 , 0.01575795, 0.01554175, ..., 0.01541713, 0.01540702,
0.01633761]], dtype=float32), array([[-1.0683594 , -5.40625 , 2.8320312 , -0.20666504],
[-0.11517334, -5.2265625 , 3.1503906 , 0.02355957],
[ 0.42407227, -4.6054688 , 2.4003906 , 0.32495117],
...,
[-1.1621094 , -3.9375 , 3.5996094 , 0.4230957 ],
[-0.87597656, -3.203125 , 2.375 , 0.79541016],
[-1.0263672 , -3.1738281 , 1.8730469 , 0.6777344 ]],
dtype=float32)]
INT8
Create int8.engine from frozen_graph.pb in Tensorflow 2.x custom model.
I used val2017 images for calibration.
As much as possible use the dataset that will be used for your inference.
# download val2017
wget http://images.cocodataset.org/zips/val2017.zip
unzip -qq val2017.zip
Your model does not have a batch_size dimension so you will need to rewrite the export code.
/home/jetson/github/TensorRT/samples/python/tensorflow_object_detection_api/image_batcher.py
/home/jetson/github/TensorRT/samples/python/tensorflow_object_detection_api/build_engine.py
image_batcher.py (8.3 KB)
build_engine.py (11.8 KB)
convert pb to trt int8
time python ~/github/TensorRT/samples/python/tensorflow_object_detection_api/build_engine.py --onnx model.onnx --engine model_int8.engine --precision int8 --calib_input=val2017 --calib_cache model_int8.calib
Result:
time python ~/github/TensorRT/samples/python/tensorflow_object_detection_api/build_engine.py --onnx model.onnx --engine model_int8.engine --precision int8 --calib_input=val2017 --calib_cache model_int8.calib
[10/07/2022-17:41:57] [TRT] [I] [MemUsageChange] Init CUDA: CPU +356, GPU +0, now: CPU 388, GPU 12053 (MiB)
[10/07/2022-17:41:57] [TRT] [I] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 388 MiB, GPU 12053 MiB
[10/07/2022-17:41:58] [TRT] [I] [MemUsageSnapshot] End constructing builder kernel library: CPU 493 MiB, GPU 12158 MiB
[10/07/2022-17:41:58] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[10/07/2022-17:41:58] [TRT] [W] ShapedWeights.cpp:173: Weights decoder/Overfeat/ip/read:0 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[10/07/2022-17:41:58] [TRT] [W] ShapedWeights.cpp:173: Weights decoder/conf_ip0/read:0 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[10/07/2022-17:41:58] [TRT] [W] ShapedWeights.cpp:173: Weights decoder/box_ip0/read:0 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
INFO:EngineBuilder:Network Description
...
DataType.FLOAT
INFO:EngineBuilder:Building int8 Engine in /home/jetson/data/model_int8.engine
[10/07/2022-17:41:58] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +227, GPU +232, now: CPU 748, GPU 12421 (MiB)
[10/07/2022-17:41:59] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +307, GPU +305, now: CPU 1055, GPU 12726 (MiB)
[10/07/2022-17:41:59] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[10/07/2022-17:42:02] [TRT] [I] Detected 1 inputs and 2 output network tensors.
[10/07/2022-17:42:03] [TRT] [I] Total Host Persistent Memory: 16512
[10/07/2022-17:42:03] [TRT] [I] Total Device Persistent Memory: 0
[10/07/2022-17:42:03] [TRT] [I] Total Scratch Memory: 24509440
[10/07/2022-17:42:03] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 271 MiB
[10/07/2022-17:42:03] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 46.2998ms to assign 6 blocks to 161 nodes requiring 316293120 bytes.
[10/07/2022-17:42:03] [TRT] [I] Total Activation Memory: 316293120
[10/07/2022-17:42:03] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1512, GPU 13493 (MiB)
[10/07/2022-17:42:03] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1512, GPU 13493 (MiB)
[10/07/2022-17:42:03] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +0, now: CPU 1512, GPU 13493 (MiB)
[10/07/2022-17:42:03] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1512, GPU 13493 (MiB)
[10/07/2022-17:42:03] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +301, now: CPU 0, GPU 333 (MiB)
[10/07/2022-17:42:03] [TRT] [I] Starting Calibration.
INFO:EngineBuilder:Calibrating image 8 / 5000
[10/07/2022-17:42:05] [TRT] [I] Calibrated batch 0 in 1.44894 seconds.
INFO:EngineBuilder:Calibrating image 16 / 5000
[10/07/2022-17:42:06] [TRT] [I] Calibrated batch 1 in 1.4619 seconds.
INFO:EngineBuilder:Calibrating image 24 / 5000
[10/07/2022-17:42:08] [TRT] [I] Calibrated batch 2 in 1.4459 seconds.
...
[10/07/2022-18:00:07] [TRT] [I] Calibrated batch 622 in 1.4218 seconds.
INFO:EngineBuilder:Calibrating image 4992 / 5000
[10/07/2022-18:00:09] [TRT] [I] Calibrated batch 623 in 1.37199 seconds.
INFO:EngineBuilder:Calibrating image 5000 / 5000
[10/07/2022-18:00:10] [TRT] [I] Calibrated batch 624 in 1.37575 seconds.
INFO:EngineBuilder:Finished calibration batches
[10/07/2022-18:00:48] [TRT] [I] Post Processing Calibration data in 37.6704 seconds.
[10/07/2022-18:00:48] [TRT] [I] Calibration completed in 1130.22 seconds.
[10/07/2022-18:00:48] [TRT] [I] Writing Calibration Cache for calibrator: TRT-8201-EntropyCalibration2
INFO:EngineBuilder:Writing calibration cache data to: model_int8.calib
[10/07/2022-18:00:48] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 1) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[10/07/2022-18:00:48] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 149) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[10/07/2022-18:00:48] [TRT] [W] Missing scale and zero-point for tensor decoder/Reshape_3:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[10/07/2022-18:00:48] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 161) [Shuffle]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[10/07/2022-18:00:48] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 162) [Softmax]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[10/07/2022-18:00:48] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 168) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[10/07/2022-18:00:48] [TRT] [I] ---------- Layers Running on DLA ----------
[10/07/2022-18:00:48] [TRT] [I] ---------- Layers Running on GPU ----------
...
[10/07/2022-18:00:48] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1582, GPU 15006 (MiB)
[10/07/2022-18:00:48] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +0, now: CPU 1583, GPU 15006 (MiB)
[10/07/2022-18:00:48] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[10/07/2022-18:05:22] [TRT] [I] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[10/07/2022-18:07:49] [TRT] [I] Detected 1 inputs and 2 output network tensors.
[10/07/2022-18:07:49] [TRT] [I] Total Host Persistent Memory: 86864
[10/07/2022-18:07:49] [TRT] [I] Total Device Persistent Memory: 14077952
[10/07/2022-18:07:49] [TRT] [I] Total Scratch Memory: 4055040
[10/07/2022-18:07:49] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 81 MiB, GPU 770 MiB
[10/07/2022-18:07:49] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 7.37309ms to assign 4 blocks to 71 nodes requiring 66654720 bytes.
[10/07/2022-18:07:49] [TRT] [I] Total Activation Memory: 66654720
[10/07/2022-18:07:49] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +7, now: CPU 1596, GPU 15758 (MiB)
[10/07/2022-18:07:49] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 1597, GPU 15766 (MiB)
[10/07/2022-18:07:49] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +7, GPU +16, now: CPU 7, GPU 16 (MiB)
INFO:EngineBuilder:Serializing engine to file: /home/jetson/data/model_int8.engine
real 25m54.304s
user 12m59.576s
sys 2m19.060s
int8 infer
python trt_classificator_av.py --model=model_int8.engine --image=testimage.jpg
load: 2.4899113178253174 sec
1st frame: 0.08790969848632812 sec
infer: 0.08123373985290527 sec
[array([[0.01563849, 0.01595112, 0.0156964 , ..., 0.01456072, 0.01473866,
0.01655393],
[0.01571514, 0.01588619, 0.01548513, ..., 0.01446212, 0.01450986,
0.01636523],
[0.01548512, 0.01525605, 0.01514633, ..., 0.01449381, 0.01466152,
0.01619074],
...,
[0.01610926, 0.01537016, 0.0157151 , ..., 0.01567235, 0.01550561,
0.01676121],
[0.01601627, 0.0155768 , 0.01564961, ..., 0.01563967, 0.01556563,
0.01645875],
[0.01588311, 0.0155883 , 0.01569124, ..., 0.01570016, 0.01566373,
0.01626299]], dtype=float32), array([[ 0.12145996, -4.7773438 , 2.5253906 , -1.3378906 ],
[ 1.4853516 , -4.1601562 , 2.5488281 , -0.71191406],
[ 2.4570312 , -3.6328125 , 2.0429688 , 0.7548828 ],
...,
[-1.6835938 , -4.3554688 , 2.71875 , -0.69628906],
[-1.3671875 , -3.34375 , 2.125 , -0.05023193],
[-1.4833984 , -3.046875 , 1.4638672 , -0.29077148]],
dtype=float32)]
on Jetson AGX Xavier 32GB JetPack 4.6.1,
FP32:
load: 2.4099135398864746 sec
1st frame: 0.15322327613830566 sec
infer: 0.14696931838989258 sec
FP16:
load: 2.4407081604003906 sec
1st frame: 0.09991908073425293 sec
infer: 0.0901947021484375 sec
INT8:
load: 2.4899113178253174 sec
1st frame: 0.08790969848632812 sec
infer: 0.08123373985290527 sec
You can see that even with int8, the speed has not increased much.
I measured the preprosessing and infer separately.
trt_classificator_av.py (11.8 KB)
FP32:
-- image --
preprocess: 0.037741661071777344 sec
infer: 0.10097241401672363 sec
FP16:
-- image --
preprocess: 0.041993141174316406 sec
infer: 0.04722285270690918 sec
INT8:
-- image --
preprocess: 0.04466676712036133 sec
infer: 0.03199887275695801 sec
You can see that preprosessing is slow.
Comment out and execute processes that are not needed. (trt_classificator_av.py)
#image = googlenet_preprocess(image)
INT8:
-- image --
preprocess: 0.013820648193359375 sec
infer: 0.03211164474487305 sec
Total: 0.045932292938232425 sec
This is 21.7 FPS.
Add
I have been trying to find out why the batch_size dimension is missing.
Perhaps the cause is this.
You can specify the input shape of your model in several different ways. For example by providing one of the following arguments to the first layer of your model:
batch_input_shape: A tuple where the first dimension is the batch size.
input_shape: A tuple that does not include the batch size, e.g., the batch size is assumed to be None or batch_size, if specified.
input_dim: A scalar indicating the dimension of the input.
Looking at the model, the input node is [H,W,C], but Unsqueeze immediately ExpandDim it to [1,H,W,C].
Therefore, if I set the input node to ExpandDim:0 instead of x_in:0, the input layer will have a batch dimension just like a normal model.
To use DeepStream, add the --inputs-as-nchw
option when converting onnx.
frozen_graph.pb (input:NHWC) to onnx (input:NCHW)
time python -m tf2onnx.convert --input frozen_graph.pb --output model.onnx --opset 12 --inputs ExpandDims:0 --outputs decoder/mul_1:0,decoder/Softmax:0 --inputs-as-nchw ExpandDims:0
Now that I have an onnx model with input node [1,C,H,W], I can convert it to TensorRT.
time /usr/src/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model_fp32.engine
time /usr/src/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model_fp16.engine --fp16
time python ~/github/TensorRT/samples/python/tensorflow_object_detection_api/build_engine.py --onnx model.onnx --engine model_int8.engine --precision int8 --calib_input=val2017 --calib_cache model_int8.calib
Now that the dimensions of the input node are correct, I will also modify the inference code.
trt_classificator_av.py (11.3 KB)
### Infer
python trt_classificator_av.py --model=model_fp32.engine --image=testimage.jpg
python trt_classificator_av.py --model=model_fp16.engine --image=testimage.jpg
python trt_classificator_av.py --model=model_int8.engine --image=testimage.jpg
This engine can also be used for inference in deepstream.
(as network-type=100, output-tensor-meta=1)