Output scale for cudnnConvolutionBiasActivationForward

Hello!

I have some troubles with cudnnConvolutionBiasActivationForward function for int8(x4, x32). The well known practice for integer arithmetic in CNN is to approximate floats via integers + common scale factor. To be precise: float ~ integer*scale. However, Figure 2 from the cuDNN developers guide indicates that after convolution results are casted to int8 based on the min-max values. Hence, the output is a single tensor and there is no way to cast it back to floats and obtain ‘real’ results. That is a problem, I can’t work with relative values, I need proper mapping. Am I missing something? Is there any workaround?

I’m especially interested in procedures that supports int8x32 data type. Mainly because of speed.

References:
Figure 2 [url]https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#scaling-parameters__fig-conv-bias-activation-forward[/url]
cudnnConvolutionBiasActivationForward [url]https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnConvolutionBiasActivationForward[/url]

By the way, it looks like INT8_EXT(from cudnnConvolutionForward) config can be a solution for my problem. But I’m not sure how does it works with overflow and I wasn’t able to run cudnnConvolutionForward with inputs/filters in INT8x32 and output data type FLOAT.

UPD: After some tests I found out that cudnnConvolutionForward with INT8x32(input, output and weight) does perform rounding followed by saturation cast when transforming from FLOAT to INT8x32. That is nice. I will continue to experiment with cudnnConvolutionBiasActivationForward.

How does it do the “saturation cast” ? How does it confirm the range ? Could you provide more info ?