Object Detection Performance Jetson Tx2 slower than expected

gustavvz · December 20, 2017, 9:26am

Hey Developers,

i am currently running several Object Detection APIs on the Jetson Tx2 to figure out which is Realtime-Detection able.

Two examples are Googles API with Tensorflow (https://github.com/tensorflow/models/tree/master/research/object_detection)
I changed it a little bit to run it as a python script with onboard or webcam as input.
and Yolo on Darknet (YOLO: Real-Time Object Detection)

I speed up the jetson with:

sudo nvpmodel -m 0
sudo ./jetson_clocks.sh

and my Performances are:
Tensorflow with SSD_Mobilenet: 4 Fps
Darknet with Tiny-Yolo: 17.5 Fps
Farknet with Yolo-v2: 2.7 Fps

tegrastats gives me:

RAM 4393/7851MB (lfb 356x4MB) CPU [43%@2035,25%@2035,15%@2035,38%@2035,40%@2035,40%@2035] BCPU@35C MCPU@35C GPU@41C PLL@35C AO@35.5C Tboard@28C Tdiode@34.5C PMIC@100C thermal@34.7C VDD_IN 12282/12517 VDD_CPU 2059/2064 VDD_GPU 4727/4803 VDD_SOC 1601/1595 VDD_WIFI 0/69 VDD_DDR 2812/2808

Tensorflow gives me:

name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB <b>freeMemory: 2.00GiB</b>
2017-12-20 10:16:28.963403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)

the “freeMemory” value varies up to 4GiB, but it is never more than that, what does that value mean? Why is it so little? How can i free more memory and assign it to the object detection task?

Those Fps are not really slow, but fast is something different. So how is it possible that the Jetson is used in autonomous cars? I was expecting much more speed. My Dell Laptop with a Nvidia GTX 1050 is twice as fast on these test scenarios.

So am i doing something wrong? How can i increase Performance in terms of Fps?

Thank you in advance!

AastaLLL · December 21, 2017, 6:50am

Hi,

Do you run TensorFlow with config.gpu_options.per_process_gpu_memory_fraction = xx?
This configuration will limit the allocated amount of GPU memory. You can get more information here:

Here are two suggestions for object detection sample:
1. DetectNet with Jetson_inference:
https://github.com/dusty-nv/jetson-inference#locating-object-coordinates-using-detectnet
2. Backend sample in Tegra Multimedia API

Thanks.

D_pz · January 3, 2018, 7:18am

I got the same situation, run tensorflow inference with ssd_mobilenet_v1 model provided by google, I only got 4 fps on video, anyone got any idea how to improve the inference speed?

gustavvz · January 3, 2018, 8:02am

@D_pz i am currently working on Jetson Tx2 with Googles Object Detection API.
I created a github repo to work with it.
Should work for you too. Would be nice if you try it out or contribute!

gustavvz · January 3, 2018, 4:49pm

@AastaLLL
no i don’t run tensorflow with this config, where should this be included?

I ran the Tensorflow object detection API and get following oupt of

sudo ./tegrastats

:

RAM 7565/7851MB (lfb 5x4MB) CPU [46%@2025,20%@2035,12%@2034,44%@2029,45%@2031,45%@2028] EMC_FREQ 5%@1866 GR3D_FREQ 6%@1300 APE 150 MTS fg 0% bg 0% BCPU@34.5C MCPU@34.5C GPU@40.5C PLL@34.5C AO@32C Tboard@29C Tdiode@32.25C PMIC@100C thermal@33.7C VDD_IN 6342/4735 VDD_CPU 2063/1405 VDD_GPU 1069/368 VDD_SOC 992/934 VDD_WIFI 19/42 VDD_DDR 1514/1316

It seems that the whole RAM is used, which is good. But the CPU Usage is only between around 10 and 50%

and the biggest Problem is: the GPU Usage is only at 6%

Do you know how i can increase the GPU Usage?
I think this is why i only get around 5fps on detecting objects with SSD Mobilenet.

S4WRXTTCS · January 4, 2018, 12:28am

The problem with detectnet is it’s not really intended for multiple objects. Sure you can do 2 or maybe 3, but I haven’t seen anything past that.

If a person needs to pick an Object detect network to detect multiple objects with a Jetson TX2 then what do they pick?

Assuming they want something reasonably fast (approx 15fps) with a reasonable resolution (640x480)?

I don’t see anything within the NVidia Digits → NVidia TX2 workflow that’s really meant for it.

In the list of things to try out there is an SSD, or Faster R-CNN. But, neither of those have been shown to operate faster than 5fps on the TX2. At least to my knowledge.

There is Yolo, but it’s my understanding one is giving up on accuracy.

AastaLLL · January 5, 2018, 7:15am

Hi all,

Here are some suggestions:

1. We recommend TensorFlow user to use our TensorRT for fully utilizing hardware resource:

2. We have a tutorial for multi-class detection with DetectNet:
[url]GitHub - dusty-nv/jetson-inference: Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.

Thanks.

AastaLLL · January 8, 2018, 3:30am

Here is some update about object detection API of TensorFlow:

From this comment:
[url]Very slow Postprocessing in Object Detection API · Issue #2710 · tensorflow/models · GitHub
Some layer in the object detection API is still in CPU mode and this explains why the performance is not good on Jetson.

Thanks.

gustavvz · January 10, 2018, 8:16am

This is interesting, thank you AastaLLL for investigating.
But for my understanding this can’t be the only reason, because i updated the config of the tf.session() of my code to let it allow GPU Memory Growth.

While the performance stays the same, the Model only uses around 300MB of Ram and the GPU and CPU usage is still at the same lvl as before.

This is what makes me wonder, neither the GPU Memory, nor the GPU Freq, nor the CPU is maxed out at any time.

So where is the bottleneck? Why doesn’t the jetson just use more of its power?

Any ideas on that?

D_pz · January 11, 2018, 2:37am

just saw your reply… I will try it, thanks a lot

AastaLLL · January 19, 2018, 3:46am

Hi,
We found the performance issue comes from a TensorFlow operation called tf.where.

This is a control flow operation and has poor performance on GPU.

We are checking if there is any available workaround to improve this issue.
Will update information with you once we have.

Thanks.

AastaLLL · January 29, 2018, 3:52am

Here are some updates:

Performance becomes better if put CNN on GPU and MAP on CPU.
It takes around 70ms on TX2 with maximized frequency.

Thanks.

jesp.hc · January 29, 2018, 7:55am

That sounds good AastaLLL. Could you provide any details to how this is achieved? I.e. how to put the MAP on the CPU instead of GPU.

gustavvz · January 29, 2018, 10:30am

would be nice if you share how to achieve this @AastaLLL!

AastaLLL · January 31, 2018, 7:22am

Hi,

We are preparing the script to share with you.
In short, we modify the .pb parser and create two networks: one for GPU and the other for CPU.

Thanks.

rsalem · February 2, 2018, 1:26am

Hi AastaLL can you please let us know if the script is ready?

AastaLLL · February 2, 2018, 6:11am

Check this issue:

github.com/tensorflow/models

Slow inference speed of object detection models and a hack as solution

opened 06:33AM - 30 Jan 18 UTC

closed 06:42PM - 07 Feb 20 UTC

wkelongws

### System information - **What is the top-level directory of the model you are… using**: models/research/object_detection/ - **Have I written custom code**: No custom code for reproducing the bug. I have written custom code for diagnosing. - **OS Platform and Distribution**: Linux Ubuntu 16.04 - **TensorFlow installed from (source or binary)**: Anaconda conda-forge channel - **TensorFlow version**: b'unknown' 1.4.1 (output from `python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"`) - **CUDA/cuDNN version**: CUDA 8.0/cuDNN 6.0 - **GPU model and memory**: 1 TITAN X (Pascal) 12189MiB - **Exact command to reproduce**: Run the provided [object detection demo](https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb) (ssd_mobilenet_v1_coco_2017_11_17 model) with a small modification in the last cell to record the inference speed: ``` i = 0 for _ in range(10): image_path = TEST_IMAGE_PATHS[1] i += 1 image = Image.open(image_path) # the array based representation of the image will be used later in order to prepare the # result image with boxes and labels on it. image_np = load_image_into_numpy_array(image) # Expand dimensions since the model expects images to have shape: [1, None, None, 3] image_np_expanded = np.expand_dims(image_np, axis=0) # Actual detection. options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) run_metadata = tf.RunMetadata() start_time = time.time() (boxes, scores, classes, num) = sess.run( [detection_boxes, detection_scores, detection_classes, num_detections], feed_dict={image_tensor: image_np_expanded}) print('Iteration %d: %.3f sec'%(i, time.time()-start_time)) ``` The results show that the inference speed is much shower than the reported inference speed, 30ms, in the [model zoo page](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md): ``` Iteration 1: 2.212 sec Iteration 2: 0.069 sec Iteration 3: 0.076 sec Iteration 4: 0.068 sec Iteration 5: 0.072 sec Iteration 6: 0.072 sec Iteration 7: 0.071 sec Iteration 8: 0.079 sec Iteration 9: 0.085 sec Iteration 10: 0.071 sec ``` ### Describe the problem **Summary:** By directly running the provided [object detection demo](https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb), the observed inference speed of object detection models in the model zoo is much slower than the reported inference speed. With some hack, a higher inference speed than the reported speed can be achieved. After some diagnostics, it is highly likely that the slow inference speed is caused by: * **tf.where and other post-processing operations are running anomaly slow on GPU; or** * **The frozen inference graph is lack of the ability to optimize the GPU/CPU assignment.** **proof of the hypothesis: tf.where and other post-processing operations are running anomaly slow on GPU** By outputting trace file, we can diagnose the running time of each node in details. To output the trace file, modify the last cell of [object detection demo](https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb) as: ``` from tensorflow.python.client import timeline with detection_graph.as_default(): with tf.Session(graph=detection_graph) as sess: # Definite input and output Tensors for detection_graph image_tensor = detection_graph.get_tensor_by_name('image_tensor:0') # Each box represents a part of the image where a particular object was detected. detection_boxes = detection_graph.get_tensor_by_name('detection_boxes:0') # Each score represent how level of confidence for each of the objects. # Score is shown on the result image, together with the class label. detection_scores = detection_graph.get_tensor_by_name('detection_scores:0') detection_classes = detection_graph.get_tensor_by_name('detection_classes:0') num_detections = detection_graph.get_tensor_by_name('num_detections:0') i = 0 for _ in range(10): image_path = TEST_IMAGE_PATHS[1] i += 1 image = Image.open(image_path) # the array based representation of the image will be used later in order to prepare the # result image with boxes and labels on it. image_np = load_image_into_numpy_array(image) # Expand dimensions since the model expects images to have shape: [1, None, None, 3] image_np_expanded = np.expand_dims(image_np, axis=0) # Actual detection. options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) run_metadata = tf.RunMetadata() start_time = time.time() (boxes, scores, classes, num) = sess.run(\ [detection_boxes, detection_scores, detection_classes, num_detections], \ feed_dict={image_tensor: image_np_expanded}, \ options=options, run_metadata=run_metadata) print('Iteration %d: %.3f sec'%(i, time.time()-start_time)) # Visualization of the results of a detection. vis_util.visualize_boxes_and_labels_on_image_array( image_np, np.squeeze(boxes), np.squeeze(classes).astype(np.int32), np.squeeze(scores), category_index, use_normalized_coordinates=True, line_thickness=8) plt.figure(figsize=IMAGE_SIZE) plt.imshow(image_np) fetched_timeline = timeline.Timeline(run_metadata.step_stats) chrome_trace = fetched_timeline.generate_chrome_trace_format() with open('Experiment_1.json' , 'w') as f: f.write(chrome_trace) ``` The output json file has been included in the .zip file in the **source code** section below. Visualizing the json file in chrome://tracing/ gives: ![experiment1](https://user-images.githubusercontent.com/14045078/35551422-dae50440-0543-11e8-896f-62bc33bcf0af.png) The CNN related operations end at ~13ms and the rest post-processing operations take about 133ms. We have noticed that adding the trace function will further slow down the inference speed. But it is shows clearly that the post-processing operations (post CNN) run very slowly on GPU. As a comparison, one can run the [object detection demo](https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb) with GPU disabled, and profile the running trace using the same method. To disable GPU, add `os.environ['CUDA_VISIBLE_DEVICES'] = ''` in the first row of the last cell. The output json file has been included in the .zip file in the **source code** section below. Visualizing this json file in chrome://tracing/ gives: ![experiment_2](https://user-images.githubusercontent.com/14045078/35581507-d329b978-05a0-11e8-808e-dc78232e284d.png) By running everything on CPU, the CNN operations end at roughly 63ms and the rest post-processing operations only takes about 15ms on CPU which is significantly faster than the time they take when running on GPU. **proof of the hypothesis: The frozen inference graph is lack of the ability to optimized the GPU/CPU assignment** We add some hack trying to see can we achieve a higher inference speed. The hack is manually assigning the CNN related nodes on GPU and the rest nodes on CPU. The idea is using GPU to accelerate only CNN operations and leave the post-processing operations on CPU. The source code has been included in the .zip file in the **source code** section below. With this hack, we are able to observe a higher inference speed than the reported speed. ``` Iteration 1: 1.021 sec Iteration 2: 0.027 sec Iteration 3: 0.026 sec Iteration 4: 0.027 sec Iteration 5: 0.026 sec Iteration 6: 0.026 sec Iteration 7: 0.026 sec Iteration 8: 0.031 sec Iteration 9: 0.031 sec Iteration 10: 0.026 sec ``` **To verify the hypothesis, here are some questions we need from the tensorflow team:** 1. Are the numbers of inference speed reported on the detection model zoo page tested on the frozen inference graphs or original graphs? 2. Are the slow tf.where and other post-processing operations supposed to run on GPU or CPU? Is the slow running speed on GPU normal? 3. Is there a device assigning function to optimize the GPU/CPU use in the original tensorflow graphs? Is that function missing in the frozen inference graphs? ### Source code / logs [tensorflowissue.zip](https://github.com/tensorflow/models/files/1678729/tensorflowissue.zip)

Thanks.

rsalem · February 8, 2018, 1:12am

Thanks, @AastaLLL, that worked out!

md.mizbauddin · March 29, 2018, 6:36pm

Hi Everyone, does anyone know how to increase YOLO FPS on Tx2? When I ran YOLO v2 on my laptop I was able to achieve about 25 FPS but when I am running it on my Tx2 I can only achieve 6-7 FPS. can anyone explain why there is so much difference?

snarky · April 1, 2018, 8:11pm

What is the GPU frequency and number of CUDA cores and architecture of the GPU in your laptop?
How does that compare to the TX2 specs?
Also note that the TX2 is aimed at 12 Watts total across CPU + GPU (give or take,) which is probably much less than your laptop is using.

Topic		Replies	Views
Performance of Tensorflow (1.5) on Jetson TX2 slower than expected Jetson TX2	3	2786	October 18, 2021
TensorFlow object detection and image classification accelerated for NVIDIA Jetson Jetson TX2	25	10511	June 3, 2019
Object Detection on GPUs in 10 Minutes Technical Blog	8	592	October 20, 2019
Deep Learning Inference Benchmarking Instructions Jetson Nano	134	47578	May 30, 2023
Object Detection with MobileNet-SSD slower than mentioned speed Jetson Nano	92	18763	October 14, 2021
Object detection models are very slow Jetson TX2	5	1462	October 18, 2021
Low GPU Usage with Tensorflow Inference on Jetson Tx2 Jetson TX2	13	4441	October 18, 2021
Python sample yolov3 app on tensorrt Jetson Xavier NX tensorrt , yolo , python	9	1692	October 18, 2021
Python TX2 CUDA Jetson TX2	7	1266	October 18, 2021
Face detection using jetson inference and custom model Jetson Nano tensorrt , jetson-inference	6	2226	March 9, 2022

Object Detection Performance Jetson Tx2 slower than expected

Related topics