Hi, I know that there is no datasheet (source), but I have found preliminary datasheet: link. From there and from this_site
I conclude that Jetson Nano has ~500 GFLOPS of FP16 precision and god-knows how many FP32 precision, but I thought that Nano is FP16 oriented.
But I accidentally compared fp32 and fp16 inference time of standard Keras MobileNet model – FP16 inference time is x4 slower, why so?
My code:
import tensorflow as tf
import numpy as np
import tensorflow.contrib.keras as K
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
tf.Session(config=config)
# FP32
K.backend.clear_session()
K.backend.set_learning_phase(0)
K.backend.set_floatx('float32')
with K.backend.get_session():
mobilenet = tf.keras.applications.mobilenet.MobileNet(input_shape=None, alpha=1.0, depth_multiplier=1, dropout=1e-3, include_top=True, weights='imagenet', input_tensor=None, pooling=None, classes=1000)
tmp = (np.random.standard_normal([1, 224, 224, 3]) * 255).astype(np.uint8)
img = tmp.astype(np.float32)
outs = mobilenet(img)
%timeit -n1 -r1 outs = mobilenet(img)
>> 1.08 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -n1 -r2 outs = mobilenet(img)
>> 1.08 s ± 8.53 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
%timeit -n1 -r10 outs = mobilenet(img)
>> 1.07 s ± 5.48 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
%timeit -n1 -r20 outs = mobilenet(img)
>> 1.09 s ± 22.3 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)
# FP16
K.backend.clear_session()
K.backend.set_learning_phase(0)
K.backend.set_floatx('float16')
with K.backend.get_session():
mobilenet = tf.keras.applications.mobilenet.MobileNet(input_shape=None, alpha=1.0, depth_multiplier=1, dropout=1e-3, include_top=True, weights='imagenet', input_tensor=None, pooling=None, classes=1000)
tmp = (np.random.standard_normal([1, 224, 224, 3]) * 255).astype(np.uint8)
img = tmp.astype(np.float16)
outs = mobilenet(img)
%timeit -n1 -r1 outs = mobilenet(img)
4.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -n1 -r2 outs = mobilenet(img)
4.68 s ± 31.4 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
%timeit -n1 -r10 outs = mobilenet(img)
4.66 s ± 16.5 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
I used tensorflow from https://devtalk.nvidia.com/default/topic/1048776/jetson-nano/official-tensorflow-for-jetson-nano-/