fake_quant_with_min_max_args support

Seerdecker · February 13, 2020, 8:14pm

Hello

I have a model in Tensorflow with a fake_quant_with_min_max_args operation. I am running into problems converting the TF graph into a format that TensorRT understands.

I tried to follow the process described in Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

“Tensorflow quantized model with tensorflow::ops::FakeQuantWithMinMaxVars or tensorflow::ops::FakeQuantWithMinMaxVarsPerChannel nodes can be converted to sequence of QuantizeLinear and DequantizeLinear nodes (QDQ nodes).”

Question 1: where/how does this conversion happen?

“We use the tf2onnx converter to convert a quantized frozen model to a quantized ONNX model.”

I downloaded the latest snapshot of this code (GitHub - onnx/tensorflow-onnx: Convert TensorFlow, Keras, Tensorflow.js and Tflite models to ONNX) and there’s no support for converting a FakeQuantWithMinMaxVars node to QuantizeLinear. Neither symbol even appear anywhere in the source code. When I try to convert the graph, I get this error:

python ~/repos/tensorflow-onnx/tf2onnx/convert.py --input convert/out.pb --output convert/out.onnx --inputs “input_data:0” --outputs “…” --opset 11

2020-02-13 13:23:19,967 - INFO - Using tensorflow=1.14.0, onnx=1.6.0, tf2onnx=1.6.0/18ac0e
2020-02-13 13:23:19,967 - INFO - Using opset <onnx, 11>
2020-02-13 13:23:20,609 - ERROR - Tensorflow op [my_layer: FakeQuantWithMinMaxArgs] is not supported
2020-02-13 13:23:20,656 - ERROR - Failed to convert node seg_deconv1_1/BiasAdd

Question 2: does the TensorRT UFF parser contains any support for FakeQuantWithMinMaxVars?

Finally, I investigated whether it’s possible to bypass the issue by using an explicit precision network.

"Conversion of activation values between higher and lower precision is performed using scale layers. TensorRT identifies special quantizing and dequantizing scale layers for explicit precision networks. A quantizing scale layer has FP32 input, INT8 output, per channel or per tensor scales and no shift weights. A dequantizing scale layer has INT8 input, FP32 output, per tensor scales and no shift weights. No shift weights are allowed for quantizing and dequantizing scale layers as only symmetric quantization is supported. Such mixed-precision scale layers are only enabled for explicit precision networks.

For best performance, the special quantizing scale layers can be inserted immediately following Convolution and FullyConnected layers. In these cases, the scale layer is fused with the preceding layer.
"

Question 3: what exactly do I need to do here?

If I understand correctly, I must load my model using the UFF or ONNX parser, explicitly set the precision of all layers to 8-bit, and insert a scale layer with a scalar FP32 weight after each convolution layer. Is the scale layer added before or after the RELU activation? Is the RELU 32-bit or 8-bit? Is there any example on how to do this?

Thank you

yongzhe · February 13, 2020, 10:55pm

can use tensorflow/tools/graph_transforms:transform_graph to remove FakeQuantWithMinMax

SunilJB · February 21, 2020, 6:14am

Hi,

Seems to be documentation issue.
It’s not supported in upstream tf2onnx yet (https://github.com/onnx/tensorflow-onnx/issues/559), but you can try below branch at least for FakeVars, not sure about FakeArgs. It’s a work in progress and not officially upstream yet (Not officially NVIDIA supported product)
https://github.com/jhalakpatel/tensorflow-onnx/tree/fake_quant_ops_rewriter

Thanks

Seerdecker · February 24, 2020, 9:24pm

Hi,

thanks for the help.

I tried the “fake_quant_ops_rewriter” branch. Here’s the result of the conversion for a dummy network.

The conversion doesn’t make much sense to me. TensorRT fails to import it in any case.

ERROR: onnx2trt_utils.cpp:1360 In function convMultiInput:
[8] Assertion failed: filter_dim.d[nbSpatialDims - i] == kernel_tensor_ptr->getDimensions().d[kernel_tensor_ptr->getDimensions().nbDims - i]

It’s not an issue about the symmetric range because I get the same problem if I use [-127, 127].

Since the doc claims that “tf2onnx” is able to perform the conversion, I assume that someone at Nvidia must have gotten it to work. Is there any special trick to get it working?

Is there any support for int8 quantization in the UFF format?