TF/Keras inference 4 times faster with FP32 precision than with FP16

Hi, I know that there is no datasheet (source), but I have found preliminary datasheet: link. From there and from this_site
I conclude that Jetson Nano has ~500 GFLOPS of FP16 precision and god-knows how many FP32 precision, but I thought that Nano is FP16 oriented.

But I accidentally compared fp32 and fp16 inference time of standard Keras MobileNet model – FP16 inference time is x4 slower, why so?

My code:

import tensorflow as tf
import numpy as np
import tensorflow.contrib.keras as K


config = tf.ConfigProto()
config.gpu_options.allow_growth = True
tf.Session(config=config)


# FP32
K.backend.clear_session()
K.backend.set_learning_phase(0)
K.backend.set_floatx('float32')
with K.backend.get_session():
    mobilenet = tf.keras.applications.mobilenet.MobileNet(input_shape=None, alpha=1.0, depth_multiplier=1, dropout=1e-3, include_top=True, weights='imagenet', input_tensor=None, pooling=None, classes=1000)

    tmp = (np.random.standard_normal([1, 224, 224, 3]) * 255).astype(np.uint8)
    img = tmp.astype(np.float32)
    outs = mobilenet(img)


%timeit -n1 -r1  outs = mobilenet(img)
>> 1.08 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%timeit -n1 -r2  outs = mobilenet(img)
>> 1.08 s ± 8.53 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

%timeit -n1 -r10  outs = mobilenet(img)
>> 1.07 s ± 5.48 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

%timeit -n1 -r20  outs = mobilenet(img)
>> 1.09 s ± 22.3 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)


# FP16
K.backend.clear_session()
K.backend.set_learning_phase(0)
K.backend.set_floatx('float16')
with K.backend.get_session():
    mobilenet = tf.keras.applications.mobilenet.MobileNet(input_shape=None, alpha=1.0, depth_multiplier=1, dropout=1e-3, include_top=True, weights='imagenet', input_tensor=None, pooling=None, classes=1000)

    tmp = (np.random.standard_normal([1, 224, 224, 3]) * 255).astype(np.uint8)
    img = tmp.astype(np.float16)
    outs = mobilenet(img)


%timeit -n1 -r1  outs = mobilenet(img)
4.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%timeit -n1 -r2  outs = mobilenet(img)
4.68 s ± 31.4 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

%timeit -n1 -r10  outs = mobilenet(img)
4.66 s ± 16.5 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

I used tensorflow from https://devtalk.nvidia.com/default/topic/1048776/jetson-nano/official-tensorflow-for-jetson-nano-/

Hi,

This is related to TensorFlow’s implementation.
We will try to reproduce this and forward to our internal team to see what we can help.

Thanks.

Hi,

We can reproduce this and forward to our internal team.
Will update information here if got any feedback.

Thanks.

My guess would be that the time is spent in data conversion.

My understanding is that the A57 in the Nano doesn’t have fp16 extensions: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473c/CJAFAIFF.html
Meanwhile, I think the Carmel cores in the Xavier, do have such extensions, and thus code tuned on the Xavier, might make a CPU/GPU trade-off that’s a bad match for the Nano.
I’d love to learn more about this area, though!

As far as I know and could see from my code above:

  1. I am running t on GPU not on CPU
  2. I do data conversion of image outside of %%timeit
  3. GPU in Nano does have FP16 extension and it is faster than FP32
# This one should run on GPU
K.backend.set_floatx('float16')
# This one out of speed measurement 
img = tmp.astype(np.float16)
# This part will run forward inference of NN (with already converted to fp16 weights )
# using already converted to fp16 image 10 times 
%timeit -n1 -r10  outs = mobilenet(img)

Please, explain where it should spent time to data conversion?

Hi,

This is related to the TensorFlow implementation.
We have passed it to our related team and will be prioritized internally.

More, it’s also recommended to use TF-TRT.
With TF-TRT, we observe around 2x speedup on Mobilenet for FP16.

Thanks.

Hi, any updates? It’s been a while.

Hi,

As mentioned in the comment#6, this issue is related to the TensorFlow implementation, rather than TRT.
We can see 2x speedup on the Mobilenet + FP16 with TF-TRT. Is TF-TRT an optional for you?

Please understand that it’s not easy for us to update the TensorFlow implementation since it is a third-party library.
This issue is prioritized internally and it’s also recommended to find TensorFlow team for direct support.

Thanks.