Running with GPU is slower than CPU on TX2

I copied the code from github, and ran the sample with CPU and GPU. Its results really shocked me.
With CPU, it detected in 1.2s. With GPU, it detected in 7s.
The code is from yolo_tensorflow

My environment:
Jetson TX2
Tensorflow 1.3
Cuda compilation tools, release 8.0, V8.0.72
python 3.5.2

The ouput with GPU:
nvidia@tegra-ubuntu:/media/nvidia/YYFSD/DL/YOLO$ python
2018-04-27 15:43:36.106786: I tensorflow/stream_executor/cuda/] ARM64 does not support NUMA - returning NUMA node zero
2018-04-27 15:43:36.106972: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties:
name: NVIDIA Tegra X2
major: 6 minor: 2 memoryClockRate (GHz) 1.3005
pciBusID 0000:00:00.0
Total memory: 7.67GiB
Free memory: 3.76GiB
2018-04-27 15:43:36.107041: I tensorflow/core/common_runtime/gpu/] DMA: 0
2018-04-27 15:43:36.107071: I tensorflow/core/common_runtime/gpu/] 0: Y
2018-04-27 15:43:36.107138: I tensorflow/core/common_runtime/gpu/] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0)
Restoring weights from: data/weights/YOLO_small.ckpt
Average detecting time: 7.569s

The output with CPU:
Restoring weights from: data/weights/YOLO_small.ckpt
Average detecting time: 1.196s

I would really appreciate it if someone can help me out.


It’s recommended to try TensorFlow1.7 since it supports TensorRT.
You can download the pre-built wheel here: