I have been trying to use the trt.create_inference_graph to convert my Keras translated Tensorflow saved model from FP32 to FP16 and INT8,and then saving it in a format that can be used for TensorFlow serving. Code here - Google Colab
However running this with my test client, I see no change in the timing.
Here is the timing; What am I missing ?
FP32 - V100 -No optimization
(‘Label’, ‘person’, ’ at ', array([409, 167, 728, 603]), ’ Score ', 0.968112)
(‘Label’, ‘person’, ’ at ', array([ 0, 426, 512, 785]), ’ Score ', 0.8355837)
(‘Label’, ‘person’, ’ at ', array([ 723, 475, 1067, 791]), ’ Score ', 0.7234411)
(‘Label’, ‘tie’, ’ at ', array([527, 335, 569, 505]), ’ Score ', 0.52543193)
('Time for ', 20, ’ is ', 1.4128220081329346)
('Time for ', 10, ’ is ', 0.7228488922119141)
FP 32 with TensorFlow based Optimization - TransformGraph
without weight or model quantization
('Time for ', 10, ’ is ', 0.6342859268188477)
FP ?? with TensorFlow based Optimization - +Weight Quantized- TransformGraph
After weight quatized; Model size is 39 MB!! (from ~149 MB)
But time is double
('Time for ', 10, ’ is ', 1.201113224029541)
Model Quantization - Does not work (at least with TF Serving)
Using NVIDIA TensorRT Optimization (colab notebook)
FP16 - v100
(‘Label’, ‘person’, ’ at ', array([409, 167, 728, 603]), ’ Score ', 0.9681119)
(‘Label’, ‘person’, ’ at ', array([ 0, 426, 512, 785]), ’ Score ', 0.83558357)
(‘Label’, ‘person’, ’ at ', array([ 723, 475, 1067, 791]), ’ Score ', 0.7234408)
(‘Label’, ‘tie’, ’ at ', array([527, 335, 569, 505]), ’ Score ', 0.52543193)
('Time for ', 10, ’ is ', 0.8691568374633789)
'Time for ', 20, ’ is ', 1.6196839809417725)
INT 8
(‘Label’, ‘person’, ’ at ', array([409, 167, 728, 603]), ’ Score ', 0.9681119)
(‘Label’, ‘person’, ’ at ', array([ 0, 426, 512, 785]), ’ Score ', 0.83558357)
(‘Label’, ‘person’, ’ at ', array([ 723, 475, 1067, 791]), ’ Score ', 0.7234408)
(‘Label’, ‘tie’, ’ at ', array([527, 335, 569, 505]), ’ Score ', 0.52543193)
('Time for ', 10, ’ is ', 0.8551359176635742)