TF/Keras inference 4 times faster with FP32 precision than with FP16

b.a.kabakov · June 3, 2019, 2:46pm

Hi, I know that there is no datasheet (source), but I have found preliminary datasheet: link. From there and from this_site
I conclude that Jetson Nano has ~500 GFLOPS of FP16 precision and god-knows how many FP32 precision, but I thought that Nano is FP16 oriented.

But I accidentally compared fp32 and fp16 inference time of standard Keras MobileNet model – FP16 inference time is x4 slower, why so?

My code:

import tensorflow as tf
import numpy as np
import tensorflow.contrib.keras as K


config = tf.ConfigProto()
config.gpu_options.allow_growth = True
tf.Session(config=config)


# FP32
K.backend.clear_session()
K.backend.set_learning_phase(0)
K.backend.set_floatx('float32')
with K.backend.get_session():
    mobilenet = tf.keras.applications.mobilenet.MobileNet(input_shape=None, alpha=1.0, depth_multiplier=1, dropout=1e-3, include_top=True, weights='imagenet', input_tensor=None, pooling=None, classes=1000)

    tmp = (np.random.standard_normal([1, 224, 224, 3]) * 255).astype(np.uint8)
    img = tmp.astype(np.float32)
    outs = mobilenet(img)


%timeit -n1 -r1  outs = mobilenet(img)
>> 1.08 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%timeit -n1 -r2  outs = mobilenet(img)
>> 1.08 s ± 8.53 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

%timeit -n1 -r10  outs = mobilenet(img)
>> 1.07 s ± 5.48 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

%timeit -n1 -r20  outs = mobilenet(img)
>> 1.09 s ± 22.3 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)


# FP16
K.backend.clear_session()
K.backend.set_learning_phase(0)
K.backend.set_floatx('float16')
with K.backend.get_session():
    mobilenet = tf.keras.applications.mobilenet.MobileNet(input_shape=None, alpha=1.0, depth_multiplier=1, dropout=1e-3, include_top=True, weights='imagenet', input_tensor=None, pooling=None, classes=1000)

    tmp = (np.random.standard_normal([1, 224, 224, 3]) * 255).astype(np.uint8)
    img = tmp.astype(np.float16)
    outs = mobilenet(img)


%timeit -n1 -r1  outs = mobilenet(img)
4.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%timeit -n1 -r2  outs = mobilenet(img)
4.68 s ± 31.4 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

%timeit -n1 -r10  outs = mobilenet(img)
4.66 s ± 16.5 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

I used tensorflow from https://devtalk.nvidia.com/default/topic/1048776/jetson-nano/official-tensorflow-for-jetson-nano-/

AastaLLL · June 4, 2019, 2:59am

Hi,

This is related to TensorFlow’s implementation.
We will try to reproduce this and forward to our internal team to see what we can help.

Thanks.

AastaLLL · June 4, 2019, 6:10am

Hi,

We can reproduce this and forward to our internal team.
Will update information here if got any feedback.

Thanks.

snarky · June 4, 2019, 3:59pm

My guess would be that the time is spent in data conversion.

My understanding is that the A57 in the Nano doesn’t have fp16 extensions: Documentation – Arm Developer
Meanwhile, I think the Carmel cores in the Xavier, do have such extensions, and thus code tuned on the Xavier, might make a CPU/GPU trade-off that’s a bad match for the Nano.
I’d love to learn more about this area, though!

b.a.kabakov · June 13, 2019, 2:23pm

As far as I know and could see from my code above:

I am running t on GPU not on CPU
I do data conversion of image outside of %%timeit
GPU in Nano does have FP16 extension and it is faster than FP32

# This one should run on GPU
K.backend.set_floatx('float16')

# This one out of speed measurement 
img = tmp.astype(np.float16)

# This part will run forward inference of NN (with already converted to fp16 weights )
# using already converted to fp16 image 10 times 
%timeit -n1 -r10  outs = mobilenet(img)

Please, explain where it should spent time to data conversion?

AastaLLL · June 21, 2019, 7:19am

Hi,

This is related to the TensorFlow implementation.
We have passed it to our related team and will be prioritized internally.

More, it’s also recommended to use TF-TRT.
With TF-TRT, we observe around 2x speedup on Mobilenet for FP16.

Thanks.

b.a.kabakov · August 13, 2019, 10:06am

Hi, any updates? It’s been a while.

AastaLLL · September 2, 2019, 6:21am

Hi,

As mentioned in the comment#6, this issue is related to the TensorFlow implementation, rather than TRT.
We can see 2x speedup on the Mobilenet + FP16 with TF-TRT. Is TF-TRT an optional for you?

Please understand that it’s not easy for us to update the TensorFlow implementation since it is a third-party library.
This issue is prioritized internally and it’s also recommended to find TensorFlow team for direct support.

Thanks.

Topic		Replies	Views
FP16 does not decrease inference time on Jetson Nano Jetson Nano tensorrt	6	1211	August 23, 2022
Why jetson nano fp16 is slower than fp32 Jetson Nano jetson-inference	2	625	October 15, 2021
Why inference in jetson nano with fp16 is slower than fp32 Jetson Nano tensorrt , jetson-inference	9	1958	September 5, 2021
No performance improvement on Jetson Nano FP16 vs FP32 TensorRT	6	2691	February 22, 2021
Inference using FP16 and FP32 precision giving no performance gain on Jetson Nano Jetson Nano	2	1351	October 14, 2021
Keras MobileNets .h5 model inference on Jetson Nano: GPU is 10x slower than CPU Jetson Nano	3	1572	October 15, 2021
Speed of FP32 vs FP16 TAO Toolkit	4	1359	October 12, 2021
Does nano deep learning support fp16 Jetson Nano jetson-inference	2	544	October 15, 2021
No performance difference between Float16 and Float32 optimized TensorRT models Jetson AGX Xavier tensorrt	4	3135	October 10, 2021
Time of inference in FP16 and FP32 is the same Jetson TX2 tensorrt	20	1752	August 10, 2022

TF/Keras inference 4 times faster with FP32 precision than with FP16

Related topics