I have a model in Tensorflow with a fake_quant_with_min_max_args operation. I am running into problems converting the TF graph into a format that TensorRT understands.
I tried to follow the process described in https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#work-with-qat-networks
“Tensorflow quantized model with tensorflow::ops::FakeQuantWithMinMaxVars or tensorflow::ops::FakeQuantWithMinMaxVarsPerChannel nodes can be converted to sequence of QuantizeLinear and DequantizeLinear nodes (QDQ nodes).”
Question 1: where/how does this conversion happen?
“We use the tf2onnx converter to convert a quantized frozen model to a quantized ONNX model.”
I downloaded the latest snapshot of this code (https://github.com/onnx/tensorflow-onnx) and there’s no support for converting a FakeQuantWithMinMaxVars node to QuantizeLinear. Neither symbol even appear anywhere in the source code. When I try to convert the graph, I get this error:
python ~/repos/tensorflow-onnx/tf2onnx/convert.py --input convert/out.pb --output convert/out.onnx --inputs “input_data:0” --outputs “…” --opset 11
2020-02-13 13:23:19,967 - INFO - Using tensorflow=1.14.0, onnx=1.6.0, tf2onnx=1.6.0/18ac0e
2020-02-13 13:23:19,967 - INFO - Using opset <onnx, 11>
2020-02-13 13:23:20,609 - ERROR - Tensorflow op [my_layer: FakeQuantWithMinMaxArgs] is not supported
2020-02-13 13:23:20,656 - ERROR - Failed to convert node seg_deconv1_1/BiasAdd
Question 2: does the TensorRT UFF parser contains any support for FakeQuantWithMinMaxVars?
Finally, I investigated whether it’s possible to bypass the issue by using an explicit precision network.
"Conversion of activation values between higher and lower precision is performed using scale layers. TensorRT identifies special quantizing and dequantizing scale layers for explicit precision networks. A quantizing scale layer has FP32 input, INT8 output, per channel or per tensor scales and no shift weights. A dequantizing scale layer has INT8 input, FP32 output, per tensor scales and no shift weights. No shift weights are allowed for quantizing and dequantizing scale layers as only symmetric quantization is supported. Such mixed-precision scale layers are only enabled for explicit precision networks.
For best performance, the special quantizing scale layers can be inserted immediately following Convolution and FullyConnected layers. In these cases, the scale layer is fused with the preceding layer.
Question 3: what exactly do I need to do here?
If I understand correctly, I must load my model using the UFF or ONNX parser, explicitly set the precision of all layers to 8-bit, and insert a scale layer with a scalar FP32 weight after each convolution layer. Is the scale layer added before or after the RELU activation? Is the RELU 32-bit or 8-bit? Is there any example on how to do this?