Keras MobileNets .h5 model inference on Jetson Nano: GPU is 10x slower than CPU

I train a MobileNetsV1 model with Keras, it generates a .h5 file. And then I put the .h5 model in the Jetson Nano board, and write a predict demo using keras,the predict code runs in the Jetson Nano.When trying to use GPU to speed up the inference, however, it shocks me, it is much slower than CPU only.I can not imagine it! On GPU : 10s , On CPU: 1s.I don’t know why, as we all known, MobileNet is a very very simple and small model.Could someone give me a favor? thanks!

<b><i>MobileNets model summary is as follows:</i></b>
Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_6 (InputLayer)         (None, 31, 56, 1)         0
_________________________________________________________________
initial_conv (Conv2D)        (None, 16, 28, 19)        171
_________________________________________________________________
initial_bn (BatchNormalizati (None, 16, 28, 19)        76
_________________________________________________________________
initial_act (Activation)     (None, 16, 28, 19)        0
_________________________________________________________________
block0_dw (DepthwiseConv2D)  (None, 16, 28, 19)        171
_________________________________________________________________
block0_bn1 (BatchNormalizati (None, 16, 28, 19)        76
_________________________________________________________________
block0_act1 (Activation)     (None, 16, 28, 19)        0
_________________________________________________________________
block0_pw (Conv2D)           (None, 16, 28, 38)        722
_________________________________________________________________
block0_bn2 (BatchNormalizati (None, 16, 28, 38)        152
_________________________________________________________________
block0_act2 (Activation)     (None, 16, 28, 38)        0
_________________________________________________________________
block1_dw (DepthwiseConv2D)  (None, 8, 14, 38)         342
_________________________________________________________________
block1_bn1 (BatchNormalizati (None, 8, 14, 38)         152
_________________________________________________________________
block1_act1 (Activation)     (None, 8, 14, 38)         0
_________________________________________________________________
block1_pw (Conv2D)           (None, 8, 14, 76)         2888
_________________________________________________________________
block1_bn2 (BatchNormalizati (None, 8, 14, 76)         304
_________________________________________________________________
block1_act2 (Activation)     (None, 8, 14, 76)         0
_________________________________________________________________
block2_dw (DepthwiseConv2D)  (None, 8, 14, 76)         684
_________________________________________________________________
block2_bn1 (BatchNormalizati (None, 8, 14, 76)         304
_________________________________________________________________
block2_act1 (Activation)     (None, 8, 14, 76)         0
_________________________________________________________________
block2_pw (Conv2D)           (None, 8, 14, 76)         5776
_________________________________________________________________
block2_bn2 (BatchNormalizati (None, 8, 14, 76)         304
_________________________________________________________________
block2_act2 (Activation)     (None, 8, 14, 76)         0
_________________________________________________________________
block3_dw (DepthwiseConv2D)  (None, 4, 7, 76)          684
_________________________________________________________________
block3_bn1 (BatchNormalizati (None, 4, 7, 76)          304
_________________________________________________________________
block3_act1 (Activation)     (None, 4, 7, 76)          0
_________________________________________________________________
block3_pw (Conv2D)           (None, 4, 7, 153)         11628
_________________________________________________________________
block3_bn2 (BatchNormalizati (None, 4, 7, 153)         612
_________________________________________________________________
block3_act2 (Activation)     (None, 4, 7, 153)         0
_________________________________________________________________
block4_dw (DepthwiseConv2D)  (None, 4, 7, 153)         1377
_________________________________________________________________
block4_bn1 (BatchNormalizati (None, 4, 7, 153)         612
_________________________________________________________________
block4_act1 (Activation)     (None, 4, 7, 153)         0
_________________________________________________________________
block4_pw (Conv2D)           (None, 4, 7, 153)         23409
_________________________________________________________________
block4_bn2 (BatchNormalizati (None, 4, 7, 153)         612
_________________________________________________________________
block4_act2 (Activation)     (None, 4, 7, 153)         0
_________________________________________________________________
block5_dw (DepthwiseConv2D)  (None, 2, 4, 153)         1377
_________________________________________________________________
block5_bn1 (BatchNormalizati (None, 2, 4, 153)         612
_________________________________________________________________
block5_act1 (Activation)     (None, 2, 4, 153)         0
_________________________________________________________________
block5_pw (Conv2D)           (None, 2, 4, 307)         46971
_________________________________________________________________
block5_bn2 (BatchNormalizati (None, 2, 4, 307)         1228
_________________________________________________________________
block5_act2 (Activation)     (None, 2, 4, 307)         0
_________________________________________________________________
block6_dw (DepthwiseConv2D)  (None, 2, 4, 307)         2763
_________________________________________________________________
block6_bn1 (BatchNormalizati (None, 2, 4, 307)         1228
_________________________________________________________________
block6_act1 (Activation)     (None, 2, 4, 307)         0
_________________________________________________________________
block6_pw (Conv2D)           (None, 2, 4, 307)         94249
_________________________________________________________________
block6_bn2 (BatchNormalizati (None, 2, 4, 307)         1228
_________________________________________________________________
block6_act2 (Activation)     (None, 2, 4, 307)         0
_________________________________________________________________
block7_dw (DepthwiseConv2D)  (None, 2, 4, 307)         2763
_________________________________________________________________
block7_bn1 (BatchNormalizati (None, 2, 4, 307)         1228
_________________________________________________________________
block7_act1 (Activation)     (None, 2, 4, 307)         0
_________________________________________________________________
block7_pw (Conv2D)           (None, 2, 4, 307)         94249
_________________________________________________________________
block7_bn2 (BatchNormalizati (None, 2, 4, 307)         1228
_________________________________________________________________
block7_act2 (Activation)     (None, 2, 4, 307)         0
_________________________________________________________________
block8_dw (DepthwiseConv2D)  (None, 1, 2, 307)         2763
_________________________________________________________________
block8_bn1 (BatchNormalizati (None, 1, 2, 307)         1228
_________________________________________________________________
block8_act1 (Activation)     (None, 1, 2, 307)         0
_________________________________________________________________
block8_pw (Conv2D)           (None, 1, 2, 614)         188498
_________________________________________________________________
block8_bn2 (BatchNormalizati (None, 1, 2, 614)         2456
_________________________________________________________________
block8_act2 (Activation)     (None, 1, 2, 614)         0
_________________________________________________________________
block9_dw (DepthwiseConv2D)  (None, 1, 1, 614)         5526
_________________________________________________________________
block9_bn1 (BatchNormalizati (None, 1, 1, 614)         2456
_________________________________________________________________
block9_act1 (Activation)     (None, 1, 1, 614)         0
_________________________________________________________________
block9_pw (Conv2D)           (None, 1, 1, 614)         376996
_________________________________________________________________
block9_bn2 (BatchNormalizati (None, 1, 1, 614)         2456
_________________________________________________________________
block9_act2 (Activation)     (None, 1, 1, 614)         0
_________________________________________________________________
global_avg (GlobalAveragePoo (None, 614)               0
_________________________________________________________________
softmax (Dense)              (None, 4)                 2460
=================================================================
Total params: 885,323
Trainable params: 875,895
Non-trainable params: 9,428
_________________________________________________________________

<b><i>The inference code is below (use CPU).</i></b>
from PIL import Image
import keras
from keras import models
#from keras.utils.generic_utils import CustomObjectScope
import numpy as np
import time
import os
import tensorflow as tf
import keras.backend.tensorflow_backend as KTF

os.environ["CUDA_VISIBLE_DEVICES"] = "1"# use CPU only
config = tf.ConfigProto()
config.gpu_options.allow_growth=True  
# config.gpu_options.per_process_gpu_memory_fraction = 0.8
sess = tf.Session(config=config)
KTF.set_session(sess)

width = 56
height = 31
imagesArrayList = []
#with CustomObjectScope({'relu6': keras.applications.mobilenet.relu6,'DepthwiseConv2D': keras.applications.mobilenet.DepthwiseConv2D}):
print('Loading model......')
model = models.load_model('my_model2_0.6_31_56.h5')
model.summary()
print('Load successful!')
#open camera
'''
'''
start = time.time()
image = Image.open('0.jpg')
image = image.resize((width, height))
imageArray = np.array(image)
imageArray = np.reshape(imageArray, (height, width, 1))
imageArray = imageArray.astype('float32')/255
imagesArrayList.append(imageArray)
imageData = np.array(imagesArrayList)
result = model.predict(imageData)
end = time.time()
label = np.argmax(result)
print(str(end-start)+'s')
print('label:', label)

<b><i>the CPU inference spends 1.417342185974121s ,output is below:</i></b>

Load successful!
1.417342185974121s
label: 1

<i><b>In contrast,os.environ["CUDA_VISIBLE_DEVICES"] is changed to be '0' for GPU inference</b></i>
<b>some tensorflow log is as follws:</b>
2019-11-02 22:52:44.860485: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-11-02 22:52:44.863771: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-11-02 22:52:44.866540: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-11-02 22:52:44.867401: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-11-02 22:52:44.871384: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-11-02 22:52:44.878644: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-11-02 22:52:44.888082: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-11-02 22:52:44.888394: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-11-02 22:52:44.888803: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-11-02 22:52:44.888921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-11-02 22:52:44.889047: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-11-02 22:52:46.082811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-02 22:52:46.082895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-11-02 22:52:46.082932: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-11-02 22:52:46.083331: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-11-02 22:52:46.083645: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero
2019-11-02 22:52:46.083833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1600 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
Loading model......
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
Load successful!
2019-11-02 22:53:12.755127: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-11-02 22:53:13.395678: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-11-02 22:53:21.320321: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 831.15MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
10.121127605438232s
label: 1

<i><b>the GPU inference spends 10.121127605438232s ,output is above.</b></i>

Hi,

Actually, this depends on the implementation of the frameworks you used.
A third-party library may not optimize their implementation for our embedded platform.

For Jetson, it’s recommended to convert the model into our own TensorRT engine for fast inference.
Here are the benchmark result of the Jetson Nano.

We can get 39FPS with Mobilenet-V2 model based on 300x300 input size.

Thanks.

@AastaLLL thanks!I have converted the model into uff file for TensorRT engine use,it achieves a very amazing speed as you mentioned above.