TensorFlow object detection and image classification accelerated for NVIDIA Jetson

We’re happy to share the following project on GitHub which demonstrates object detection and image classification workflows using TensorRT integration in TensorFlow (for details on TF-TRT integration see this blog post). With this project you can easily accelerate popular models like SSD Inception V2 for use on Jetson.

The project is hosted at the following URL

https://github.com/NVIDIA-Jetson/tf_trt_models

By following the steps outlined in this project you will

  1. Download pretrained object detection and image classification models sourced from the TensorFlow models repository
  2. Run scripts to preprocess the TensorFlow graphs for best utilization of TensorRT and Jetson
  3. Accelerate models using TensorRT integration in TensorFlow
  4. Execute models with the TensorFlow Python API

The models are sourced from the TensorFlow models repository, so it is possible to train the models for custom tasks using the steps detailed there. Provided you use one of the listed model architectures, you can follow the steps above to easily accelerate the model for ideal performance on Jetson.

Enjoy!

Hi jaybdub,

It’s easy to use! Thank you!
I try it on PC. Probably I failed in labeling.
My webcam shows car, but label says bicycle.
I tried ssd_inception_v2_coco_2017_11_17 and mscoco_label_map.pbtxt.

In mscoco_label_map.pbtxt:

item {
  name: "/m/0199g"
  id: 2
  display_name: "bicycle"
}
item {
  name: "/m/0k4j"
  id: 3
  display_name: "car"
}

car is id 3. but bicycle is id 2.

And I saw https://github.com/NVIDIA-Jetson/tf_trt_models/blob/master/examples/detection/detection.ipynb
the dog’s id seems 17.

But dog is id 18 cat is id 17.
In mscoco_label_map.pbtxt:

item {
  name: "/m/01yrx"
  id: 17
  display_name: "cat"
}
item {
  name: "/m/0bt9lr"
  id: 18
  display_name: "dog"
}

Does TensorRT need to adjust labels?

My repository is here: https://github.com/naisy/realtime_object_detection

Hi naisy,

It shouldn’t be related to TensorRT, but in this case it seems the neural network output is 0-indexed, while the label map is 1-indexed. You should be able to add +1 to each output index of the network before associating with the label map to get the correct label.

Hope this helps!

Hi jaybdub,

Thank you! It works well!

classes = np.add(classes, 1)

Thank you for the clear explanation and benchmarking on this website, and for testing out different models, it is really appreciated!

According to your execution time table, I should get 54.4ms when running ssd_inception_v2_coco on the TX2. Over 200 runs, after the network is ‘warmed up’, I get 69.63ms. This seems a significant difference to me. When looking at tegra_stats, it seems that the GPU is not very efficiently utilized (even though it varies over time, it is rarely even close to 90%):

RAM 4167/7854MB (lfb 84x4MB) CPU [49%@2035,0%@2035,0%@2035,44%@2035,47%@2032,38%@2035] EMC_FREQ 7%@1866 GR3D_FREQ 18%@1300 APE 150 MTS fg 0% bg 0% BCPU@48C MCPU@48C GPU@47.5C PLL@48C Tboard@41C Tdiode@46.25C PMIC@100C thermal@48.3C VDD_IN 7862/4839 VDD_CPU 1763/820 VDD_GPU 2531/947 VDD_SOC 997/929 VDD_WIFI 0/33 VDD_DDR 1626/1271

I just followed all the steps on the Github readme and the notebook, so any idea what could be the cause of this? I use Jetpack 3.3 and Tensorflow 1.10.

Thanks for the feedback and raising this issue!

We collected the benchmark timings under the following configuration

(1) JetPack 3.2
(2) TensorFlow 1.8
(3) MAXN power mode (sudo nvpmodel -m0 )
(4) Jetson clocks enabled (sudo ~/jetson_clocks.sh)
(5) Runtime averaged over 50 calls to sess.run(…) on a static image. This excludes reading from disk and JPEGdecoding.

First, if when you profiled (3)-(5) are different from our configuration, this would cause a difference in the timing.

If they are consistent with our profiling, then perhaps it is a performance regression from JetPack 3.2 -> 3.3, or TensorFlow 1.8 -> 1.10, which we would want to investigate.

Thanks for the reponse! I have indeed run the nvpmodel -m0 command and jetsock_clocks.sh so (3) and (4) are the same. And just so there is no doubt about it, here is the code I used to make sure (5) is comparable:

scores, boxes, classes = tf_sess.run([tf_scores, tf_boxes, tf_classes], feed_dict={tf_input: image_resized[None, ...]})

times = []
for i in range(200):
	t0 = time()
	scores, boxes, classes = tf_sess.run([tf_scores, tf_boxes, tf_classes], feed_dict={tf_input: image_resized[None, ...]})                 
	times.append(time()-t0)
print(np.mean(times))

So I would say my setup is comparable. Two other things I noticed:

  • Running graphdef.ParseFromString() on the frozen graph (generated with build_detection_graph) takes 4.7 seconds. Loading the trt_graph generated by trt.create_inference_graph takes 9 minutes and 26 seconds (!). Same with running tf.import_graph_def(graphdef, name=’’) on both files: 12.9 seconds for the frozen graph, 41.8 seconds for the trt_graph. Is this anywhere near expected times? Because it seems ridiculously long to me and could be indicative for something not working right with these versions of JetPack and TensorFlow
  • tegra_stats reports near-constant 90-100% GPU usage when running the frozen graph (which is running with comparable speed to the speed you reported, 139ms vs your reported 132ms for ssd_inception_v2_coco 300x300)

I’ll see if I can get JetPack 3.2 with Tensorflow 1.8 installed and see if I can reproduce your speeds that way to make sure there is nothing else that goes wrong

Jetpack 3.2 with Tensorflow 1.8 is a little faster, but still not as fast as reported (note that this is a different TX2 module). With the same setup as before, I now get an average runtime of 64.72ms. 5ms quicker than with Tensorflow 1.10 and Jetpack 3.3 but still 10 ms short of your measured time. Is there something I’m still missing here?

Running ParseFromString to load the trt_graph now only takes 4.59 seconds, so that bug is gone at least.

The GPU usage still seems to be suboptimal, but maybe that is inherent in the model / working with a batch of 1. This is what tegrastats reports with --interval 100:

RAM 2880/7854MB (lfb 719x4MB) CPU [60%@2035,0%@2034,0%@2034,33%@2029,30%@2034,66%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 66%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46C PLL@47C Tboard@42C Tdiode@47.5C PMIC@100C thermal@46.9C VDD_IN 8092/6901 VDD_CPU 1686/1360 VDD_GPU 2606/1968 VDD_SOC 996/956 VDD_WIFI 0/20 VDD_DDR 1640/1498
RAM 2880/7854MB (lfb 719x4MB) CPU [25%@2034,0%@2033,0%@2035,30%@2032,50%@2033,60%@2035] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@47.75C PMIC@100C thermal@46.9C VDD_IN 8015/6905 VDD_CPU 1686/1361 VDD_GPU 2530/1970 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1640/1498
RAM 2880/7854MB (lfb 719x4MB) CPU [50%@2031,0%@2035,0%@2035,55%@2031,60%@2038,58%@2035] EMC_FREQ 8%@1866 GR3D_FREQ 2%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@47.75C PMIC@100C thermal@46.9C VDD_IN 8015/6909 VDD_CPU 1763/1363 VDD_GPU 2530/1972 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1621/1499
RAM 2880/7854MB (lfb 719x4MB) CPU [30%@2031,0%@2035,0%@2035,45%@2033,27%@2034,55%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 91%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@47C PMIC@100C thermal@46.9C VDD_IN 8130/6913 VDD_CPU 1686/1364 VDD_GPU 2683/1974 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1640/1499
RAM 2880/7854MB (lfb 719x4MB) CPU [66%@2035,0%@2035,0%@2033,50%@2034,44%@2033,55%@2035] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46C PLL@47C Tboard@42C Tdiode@47C PMIC@100C thermal@46.9C VDD_IN 7900/6916 VDD_CPU 1686/1365 VDD_GPU 2454/1976 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1621/1500
RAM 2880/7854MB (lfb 719x4MB) CPU [40%@2035,0%@2034,0%@2035,54%@2035,36%@2034,60%@2032] EMC_FREQ 8%@1866 GR3D_FREQ 99%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47.5C PLL@47C Tboard@42C Tdiode@47C PMIC@100C thermal@46.9C VDD_IN 8053/6920 VDD_CPU 1686/1366 VDD_GPU 2606/1978 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1640/1500
RAM 2880/7854MB (lfb 719x4MB) CPU [58%@2035,0%@2035,0%@2035,44%@2035,55%@2034,30%@2035] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46C PLL@47C Tboard@42C Tdiode@46.5C PMIC@100C thermal@46.9C VDD_IN 7938/6924 VDD_CPU 1686/1367 VDD_GPU 2454/1980 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1621/1500
RAM 2880/7854MB (lfb 719x4MB) CPU [40%@2032,0%@2035,0%@2036,55%@2032,33%@2033,54%@2036] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47.5C PLL@47C Tboard@42C Tdiode@46.5C PMIC@100C thermal@46.9C VDD_IN 7977/6928 VDD_CPU 1686/1368 VDD_GPU 2531/1982 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1621/1501
RAM 2880/7854MB (lfb 719x4MB) CPU [33%@2033,0%@2035,0%@2035,33%@2033,63%@2033,45%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 80%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46C PLL@47C Tboard@42C Tdiode@46.75C PMIC@100C thermal@47.2C VDD_IN 8168/6932 VDD_CPU 1686/1369 VDD_GPU 2683/1984 VDD_SOC 996/958 VDD_WIFI 0/20 VDD_DDR 1659/1501
RAM 2880/7854MB (lfb 719x4MB) CPU [70%@2033,0%@2035,0%@2035,54%@2032,40%@2034,54%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47.5C PLL@47C Tboard@42C Tdiode@46.75C PMIC@100C thermal@47.2C VDD_IN 7900/6935 VDD_CPU 1686/1371 VDD_GPU 2454/1986 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1621/1502
RAM 2880/7854MB (lfb 719x4MB) CPU [36%@2033,0%@2035,0%@2033,63%@2031,22%@2032,40%@2031] EMC_FREQ 8%@1866 GR3D_FREQ 69%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@46.75C PMIC@100C thermal@47.2C VDD_IN 8053/6939 VDD_CPU 1686/1372 VDD_GPU 2606/1988 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1640/1502
RAM 2880/7854MB (lfb 719x4MB) CPU [70%@2032,0%@2035,0%@2035,40%@2026,40%@2032,54%@2032] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46C PLL@47C Tboard@42C Tdiode@47.5C PMIC@100C thermal@47.2C VDD_IN 7823/6942 VDD_CPU 1686/1373 VDD_GPU 2377/1989 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1602/1503
RAM 2880/7854MB (lfb 719x4MB) CPU [30%@2031,0%@2034,0%@2036,50%@2030,50%@2033,33%@2031] EMC_FREQ 8%@1866 GR3D_FREQ 71%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@47.5C PMIC@100C thermal@47.2C VDD_IN 7977/6945 VDD_CPU 1686/1374 VDD_GPU 2531/1991 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1621/1503
RAM 2880/7854MB (lfb 719x4MB) CPU [55%@2029,0%@2035,0%@2035,44%@2034,40%@2031,54%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 19%@1300 APE 150 MTS fg 0% bg 0% BCPU@47.5C MCPU@47.5C GPU@46C PLL@47.5C Tboard@42C Tdiode@47.75C PMIC@100C thermal@47.2C VDD_IN 8092/6949 VDD_CPU 1686/1375 VDD_GPU 2606/1993 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1640/1504
RAM 2880/7854MB (lfb 719x4MB) CPU [55%@2033,0%@2034,0%@2034,60%@2034,55%@2034,50%@2032] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@47.75C PMIC@100C thermal@47.2C VDD_IN 7977/6953 VDD_CPU 1763/1376 VDD_GPU 2454/1995 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1621/1504
RAM 2880/7854MB (lfb 719x4MB) CPU [50%@2035,0%@2036,0%@2036,50%@2031,40%@2034,45%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 14%@1300 APE 150 MTS fg 0% bg 0% BCPU@47.5C MCPU@47.5C GPU@47C PLL@47.5C Tboard@42C Tdiode@48C PMIC@100C thermal@47.2C VDD_IN 8092/6957 VDD_CPU 1686/1377 VDD_GPU 2606/1997 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1640/1504
RAM 2880/7854MB (lfb 719x4MB) CPU [58%@2033,0%@2036,0%@2034,37%@2036,44%@2037,44%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 7%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@48C PMIC@100C thermal@47.2C VDD_IN 8015/6960 VDD_CPU 1763/1378 VDD_GPU 2530/1999 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1640/1505
RAM 2880/7854MB (lfb 719x4MB) CPU [33%@2033,0%@2035,0%@2035,50%@2032,40%@2035,62%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@48C PMIC@100C thermal@47.2C VDD_IN 7977/6964 VDD_CPU 1686/1380 VDD_GPU 2531/2000 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1640/1505
RAM 2880/7854MB (lfb 719x4MB) CPU [40%@2025,0%@2035,0%@2035,40%@2034,22%@2034,45%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 2%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@47C PMIC@100C thermal@46.8C VDD_IN 8092/6967 VDD_CPU 1686/1381 VDD_GPU 2606/2002 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1640/1506
RAM 2880/7854MB (lfb 719x4MB) CPU [66%@2032,0%@2034,0%@2035,66%@2033,44%@2033,40%@2036] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@47C PMIC@100C thermal@46.8C VDD_IN 7938/6971 VDD_CPU 1686/1382 VDD_GPU 2454/2004 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1602/1506
RAM 2880/7854MB (lfb 719x4MB) CPU [20%@2032,0%@2035,0%@2035,50%@2032,36%@2032,44%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 98%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@46.75C PMIC@100C thermal@46.8C VDD_IN 8053/6974 VDD_CPU 1686/1383 VDD_GPU 2606/2006 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1640/1507
RAM 2880/7854MB (lfb 719x4MB) CPU [80%@2033,0%@2035,0%@2035,54%@2033,55%@2034,36%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@46.75C PMIC@100C thermal@46.8C VDD_IN 7823/6977 VDD_CPU 1686/1384 VDD_GPU 2377/2007 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1602/1507
RAM 2880/7854MB (lfb 719x4MB) CPU [50%@2031,0%@2035,0%@2035,30%@2035,60%@2033,36%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 99%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@46.75C PMIC@100C thermal@46.8C VDD_IN 8053/6981 VDD_CPU 1686/1385 VDD_GPU 2531/2009 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1621/1507
RAM 2880/7854MB (lfb 719x4MB) CPU [33%@2035,0%@2035,0%@2035,70%@2034,44%@2034,40%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 19%@1300 APE 150 MTS fg 0% bg 0% BCPU@47.5C MCPU@47.5C GPU@46.5C PLL@47.5C Tboard@42C Tdiode@47.5C PMIC@100C thermal@46.8C VDD_IN 7977/6984 VDD_CPU 1686/1386 VDD_GPU 2454/2010 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1621/1508
RAM 2880/7854MB (lfb 719x4MB) CPU [36%@2033,0%@2034,0%@2034,45%@2032,60%@2034,54%@2036] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@47.5C PMIC@100C thermal@46.8C VDD_IN 7977/6987 VDD_CPU 1686/1387 VDD_GPU 2454/2012 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1621/1508
RAM 2880/7854MB (lfb 719x4MB) CPU [66%@2028,0%@2034,0%@2035,40%@2034,36%@2035,44%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 25%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@47.5C PMIC@100C thermal@46.8C VDD_IN 8092/6991 VDD_CPU 1686/1387 VDD_GPU 2606/2014 VDD_SOC 996/960 VDD_WIFI 0/18 VDD_DDR 1640/1508
RAM 2880/7854MB (lfb 719x4MB) CPU [40%@2034,0%@2036,0%@2036,66%@2033,37%@2034,44%@2036] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46C PLL@47C Tboard@42C Tdiode@47.5C PMIC@100C thermal@46.8C VDD_IN 7938/6994 VDD_CPU 1686/1388 VDD_GPU 2454/2015 VDD_SOC 996/960 VDD_WIFI 0/18 VDD_DDR 1621/1509

Hi,

We have released a official TensorFlow package.
Could you repeat your experiment with our official package and share the result with us?
https://devtalk.nvidia.com/default/topic/1038957/jetson-tx2/tensorflow-for-jetson-tx2-/

Thanks.

Sure! On the Jetpack 3.2 setup, now with the official Tensorflow 1.9, I get about the same running time I got earlier with the Jetpack 3.3 and Tensorflow 1.10 setup: an average of 69.65ms per image for ssd_inception_v2. I realize I should maybe have mentioned this last time, but this is the log for the creation of the inference graph (with the official TF1.9):

>>> trt_graph = trt.create_inference_graph(
...     input_graph_def=frozen_graph,
...     outputs=output_names,
...     max_batch_size=1,
...     max_workspace_size_bytes=1 << 25,
...     precision_mode='FP16',
...     minimum_segment_size=50
... )
2018-09-10 09:01:07.543057: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0
2018-09-10 09:01:24.474770: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:438] MULTIPLE tensorrt candidate conversion: 7
2018-09-10 09:01:25.042883: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims)
2018-09-10 09:01:25.043030: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:0 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 91 nodes)
2018-09-10 09:01:25.048693: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims)
2018-09-10 09:01:25.048845: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:1 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 812 nodes)
2018-09-10 09:01:25.883711: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:2 due to: "Invalid argument: Output node 'FeatureExtractor/InceptionV2/InceptionV2/Mixed_3b/concat-4-LayoutOptimizer' is weights not tensor" SKIPPING......( 844 nodes)
2018-09-10 09:01:25.890138: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:3 due to: "Unimplemented: Operation: GatherV2 does not support tensor input as indices, at: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/FilterGreaterThan_83/Gather/GatherV2" SKIPPING......( 91 nodes)
2018-09-10 09:01:25.894630: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims)
2018-09-10 09:01:25.894759: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:4 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 180 nodes)
2018-09-10 09:01:25.898671: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims)
2018-09-10 09:01:25.898789: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:5 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 93 nodes)
2018-09-10 09:01:25.902308: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims)
2018-09-10 09:01:25.902452: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:6 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 91 nodes)

Do you also want me to check the log or performance with Jetpack 3.3 or with TF 1.10/1.8?

Hi,

Some instructions run slow on TensorFlow due to some CPU/GPU resource transferring.

Could you test it with nvprof and share the profiling data with us?
This step can help us to find out the bottleneck.

Thanks.

I’m seeing the same error as reported by frederiki3k63, when trying to optimize object detection
models with trt.create_inference_graph(). I’m also using JetPack-3.3 with the latest official tensorflow (1.9.0) wheel for TX2, as specified in https://devtalk.nvidia.com/default/topic/1038957/jetson-tx2/tensorflow-for-jetson-tx2-/

https://developer.download.nvidia.com/compute/redist/jp33/tensorflow-gpu/tensorflow_gpu-1.9.0+nv18.8-cp35-cp35m-linux_aarch64.whl

Due to this error, I think the object detection model just runs at un-optimized speed.

......
2018-09-13 16:00:44.289961: [b]E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] 
DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, 
condition: isValidDims(dims)[/b]
......

@AastaLLL, could you check and advise? Thanks.

Right. Here are the files and here’s the log:

nvidia@tegra-ubuntu:~/tf_trt_models$ nvprof python3 test_inception.py
Starting session..
==1986== NVPROF is profiling process 1986, command: python3 test_inception.py
==1986== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
2018-09-13 13:31:04.443654: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:864] ARM64 does not support NUMA - returning NUMA node zero
2018-09-13 13:31:04.443902: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 1.32GiB
2018-09-13 13:31:04.443957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-09-13 13:31:07.226353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-13 13:31:07.226440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0 
2018-09-13 13:31:07.226469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N 
2018-09-13 13:31:07.226696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 557 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Loading graph..
Running network one time to warm up..
Starting test..
Average run time: 0.08413341760635376
==1986== Profiling application: python3 test_inception.py
==1986== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   23.06%  935.20ms      3015  310.18us  70.400us  1.4286ms  trtwell_fp16x2_hcudnn_winograd_fp16x2_128x128_ldg1_ldg4_relu_tile148m_nt
                   12.01%  487.14ms      1809  269.29us  95.584us  1.2186ms  trt_maxwell_fp16x2_hcudnn_fp16x2_128x64_relu_small_nn_v1
                    7.24%  293.60ms     15477  18.969us  1.7600us  274.34us  void cuScale::scale<__half, __half, cuScale::Mode, bool=0, int=4, cuScale::FusedActivationType>(__half const *, cuScale::scale<__half, __half, cuScale::Mode, bool=0, int=4, cuScale::FusedActivationType>*, cuScale::KernelParameters<cuScale::scale<__half, __half, cuScale::Mode, bool=0, int=4, cuScale::FusedActivationType>>, nvinfer1::cudnn::reduced_divisor, nvinfer1::cudnn, nvinfer1::cudnn)
                    7.00%  283.71ms      2211  128.32us  50.816us  232.64us  trt_maxwell_fp16x2_hcudnn_fp16x2_128x64_relu_interior_nn_v1
                    6.32%  256.13ms      1608  159.29us  68.320us  239.46us  trtwell_fp16x2_hcudnn_fp16x2_128x64_relu_interior_nn
                    5.54%  224.52ms     15477  14.506us  3.3600us  166.24us  void cudnn::detail::activation_fw_4d_kernel<__half, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=1, bool=0>>(cudnnTensorStruct, __half const *, cudnn::detail::activation_fw_4d_kernel<__half, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=1, bool=0>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*)
                    5.10%  206.76ms     15276  13.535us  1.1840us  203.90us  void cuEltwise::eltwise<cuEltwise::SimpleAlgo<__half, long>, cuEltwise::Compute<nvinfer1::ElementWiseOperation>>(cuEltwise::LaunchParams)
                    3.89%  157.86ms       603  261.79us  256.10us  270.88us  trtwell_scudnn_128x32_relu_interior_nn
                    2.82%  114.21ms       603  189.40us  93.280us  277.66us  trtwell_fp16x2_hcudnn_fp16x2_128x128_relu_interior_nn
                    2.65%  107.27ms       603  177.90us  90.144us  402.24us  trtwell_fp16x2_hcudnn_fp16x2_128x128_relu_small_nn
                    2.51%  101.93ms       804  126.78us  101.44us  144.26us  void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=1, int=4>, fused::KpqkPtrWriter<__half, int=2, int=2>, __half2, __half, int=2, int=5, int=2, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=4, int=2Type>, float)
                    2.48%  100.66ms       402  250.40us  193.76us  310.82us  trtwell_fp16x2_hcudnn_fp16x2_128x64_relu_small_nn
                    1.97%  79.866ms      1206  66.223us  2.1440us  246.46us  void cuPad::pad<__half, __half2, int=128, bool=1>(__half2*, int, cuPad::pad<__half, __half2, int=128, bool=1> const *, int, int, int, int, int, int, int, int, int, nvinfer1::cudnn::reduced_divisor, nvinfer1::cudnn, nvinfer1::cudnn, float const *, float const )
                    1.93%  78.395ms      1407  55.717us  20.736us  103.30us  void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=1, int=2>, fused::KpqkPtrWriter<__half, int=2, int=1>, __half2, __half, int=3, int=7, int=8, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=2, int=2Type>, float)
                    1.67%  67.523ms       402  167.97us  165.66us  172.16us  void nvinfer1::tiled_pooling::poolCHW_RS3_UV2_PQT_kernel<int=4, int=4, int=32, int=2, nvinfer1::ITiledPooling::PoolingMode>(nvinfer1::TiledPoolingParams, int)
                    1.36%  54.956ms       804  68.353us  64.800us  72.159us  void nvinfer1::tiled_pooling::poolCHW_RS3_UV1_PQT_kernel<int=4, int=4, int=32, int=2, nvinfer1::ITiledPooling::PoolingMode>(nvinfer1::TiledPoolingParams, int)
                    1.26%  50.947ms       201  253.47us  243.94us  279.30us  void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=1, int=1>, fused::KpqkPtrWriter<__half, int=1, int=1>, __half2, __half, int=5, int=7, int=4, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=1, int=2Type>, float)
                    1.25%  50.630ms       603  83.964us  55.520us  111.84us  void nvinfer1::tiled_pooling::poolCHW_RS3_UV1_PQT_kernel<int=4, int=8, int=16, int=2, nvinfer1::ITiledPooling::PoolingMode>(nvinfer1::TiledPoolingParams, int)
                    0.97%  39.172ms       201  194.89us  193.60us  197.06us  void tensorflow::_GLOBAL__N__60_tmpxft_00001137_00000000_6_resize_bilinear_op_gpu_cu_cpp1_ii_c6ae9512::ResizeBilinearKernel<float>(int, float const *, float, float, int, int, int, int, int, int, float*)
                    0.97%  39.139ms       201  194.72us  194.02us  195.90us  void cuInt8::nchwToNchhw2<float>(float const *, __half*, int, int, int, int, cuInt8::ReducedDivisorParameters)
                    0.86%  34.927ms       201  173.77us  171.90us  176.58us  void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=1024, int=1024, int=2, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=1024, int=1024, int=2, bool=0>*)
                    0.86%  34.674ms       852  40.697us     160ns  11.018ms  [CUDA memcpy HtoD]
                    0.68%  27.699ms       402  68.902us  66.560us  72.064us  void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=2, int=2>, fused::KpqkPtrWriter<__half, int=2, int=1>, __half2, __half, int=3, int=10, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=2, int=2Type>, float)
                    0.66%  26.603ms       201  132.35us  129.82us  135.36us  trt_maxwell_fp16x2_hcudnn_fp16x2_128x64_relu_large_nn_v1
                    0.51%  20.512ms       201  102.05us  99.136us  106.56us  trt_maxwell_fp16x2_hcudnn_fp16x2_128x128_relu_small_nn_v1
                    0.47%  18.992ms       201  94.489us  93.824us  95.680us  void cuPad::pad<float, float, int=128, bool=1>(float*, int, cuPad::pad<float, float, int=128, bool=1> const *, int, int, int, int, int, int, int, int, int, nvinfer1::cudnn::reduced_divisor, nvinfer1::cudnn, nvinfer1::cudnn, float const *, float const )
                    0.42%  17.199ms      2412  7.1300us     960ns  36.640us  void cuInt8::nchhw2ToNchw<float>(__half const *, float*, int, int, int, int, cuInt8::ReducedDivisorParameters)
                    0.39%  15.651ms       201  77.868us  75.744us  81.120us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=3, int=3, int=2, int=2, int=4, int=28, int=513, int=6, int=2, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams)
                    0.33%  13.584ms      1005  13.516us  2.5600us  34.400us  void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>*)
                    0.31%  12.657ms      3216  3.9350us     800ns  27.264us  [CUDA memcpy DtoD]
                    0.29%  11.641ms       201  57.914us  56.096us  60.064us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_difference_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
                    0.28%  11.268ms       201  56.060us  54.240us  58.144us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_left<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
                    0.25%  10.095ms       201  50.224us  49.344us  51.520us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=3, int=3, int=2, int=2, int=7, int=7, int=225, int=6, int=2, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams)
                    0.25%  10.076ms       201  50.130us  47.776us  52.640us  trt_maxwell_fp16x2_hcudnn_fp16x2_128x32_relu_interior_nn_v1
                    0.23%  9.3126ms       201  46.331us  43.584us  49.120us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=3, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<int, int=3> const , Eigen::DSizes<int, int=3> const , Eigen::TensorMap<Eigen::Tensor<float const , int=3, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=3)
                    0.19%  7.8614ms      1005  7.8220us     224ns  37.376us  [CUDA memcpy DtoH]
                    0.18%  7.4801ms       201  37.214us  35.520us  39.264us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_sigmoid_op<float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
                    0.17%  6.7136ms       201  33.401us  32.800us  34.080us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=3, int=3, int=1, int=1, int=14, int=14, int=256, int=6, int=2, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams)
                    0.10%  3.9531ms       201  19.667us  18.816us  20.576us  void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=2, int=2>, fused::KpqkPtrWriter<__half, int=2, int=1>, __half2, __half, int=8, int=2, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=2, int=2Type>, float)
                    0.07%  2.7159ms       201  13.511us  13.119us  14.336us  void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=4, int=1>, fused::KpqkPtrWriter<__half, int=1, int=1>, __half2, __half, int=2, int=7, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=1, int=2Type>, float)
                    0.06%  2.6248ms       201  13.058us  12.544us  13.600us  void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=2, int=2>, fused::KpqkPtrWriter<__half, int=2, int=1>, __half2, __half, int=4, int=4, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=2, int=2Type>, float)
                    0.06%  2.2885ms       201  11.385us  11.104us  12.320us  void cuEltwise::eltwise<cuEltwise::StripMineAlgo<__half, int>, cuEltwise::Compute<nvinfer1::ElementWiseOperation>>(cuEltwise::LaunchParams)
                    0.05%  2.1641ms       201  10.766us  10.400us  11.360us  void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=128, int=12, int=128, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=128, int=12, int=128, bool=0>*)
                    0.05%  2.0480ms       201  10.189us  7.5200us  16.160us  void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=512, int=4, int=512, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=512, int=4, int=512, bool=0>*)
                    0.04%  1.7993ms       201  8.9510us  8.5440us  9.4400us  void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=256, int=8, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=256, int=8, bool=0>*)
                    0.04%  1.6919ms       201  8.4170us  7.9040us  8.9600us  void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=512, int=512, int=4, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=512, int=512, int=4, bool=0>*)
                    0.04%  1.6261ms       804  2.0220us  1.5040us  3.2000us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
                    0.04%  1.6132ms       201  8.0260us  7.7440us  8.7040us  void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=256, int=4, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=256, int=4, bool=0>*)
                    0.04%  1.4325ms       804  1.7810us  1.1840us  3.4240us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
                    0.03%  1.1131ms       402  2.7680us  1.4400us  4.4160us  void tensorflow::functor::SwapDimension1And2InTensor3Simple<unsigned int, bool=0>(int, unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3Simple<unsigned int, bool=0>*)
                    0.03%  1.1125ms       804  1.3830us     960ns  2.4000us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, int=2> const , Eigen::DSizes<long, int=2> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=2)
                    0.02%  836.58us       402  2.0810us  1.3440us  3.1360us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_exp_op<float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
                    0.02%  727.14us       402  1.8080us  1.1840us  2.8800us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
                    0.02%  700.10us       402  1.7410us  1.1840us  2.5600us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
                    0.00%  55.712us        19  2.9320us     960ns  9.6640us  dit::computeOffsetsKernel(dit::ComputeOffsetsParams)
                    0.00%  5.6640us         9     629ns     160ns  1.5360us  [CUDA memset]
      API calls:   35.55%  3.88662s        19  204.56ms  1.4720us  2.78165s  cudaFree
                   35.48%  3.87971s        16  242.48ms  3.3920us  3.87934s  cudaStreamCreateWithFlags
                   16.83%  1.84008s     63535  28.961us  22.048us  735.01us  cudaLaunch
                    3.16%  345.50ms     11457  30.156us  22.272us  505.57us  cudaLaunchKernel
                    2.22%  242.86ms         1  242.86ms  242.86ms  242.86ms  cuDevicePrimaryCtxRetain
                    1.56%  170.91ms       813  210.22us  24.096us  650.46us  cuMemcpyHtoDAsync
                    1.45%  158.48ms      3251  48.748us  28.672us  12.520ms  cudaMemcpyAsync
                    1.13%  123.66ms    304735     405ns     287ns  569.50us  cudaSetupArgument
                    0.53%  57.682ms      1005  57.395us  21.504us  300.80us  cuMemcpyDtoHAsync
                    0.49%  53.481ms     29707  1.8000us  1.0560us  603.74us  cuEventQuery
                    0.48%  53.030ms      3636  14.584us  1.5680us  215.94us  cuEventRecord
                    0.35%  38.568ms     63535     607ns     384ns  521.41us  cudaConfigureCall
                    0.30%  32.768ms     67555     485ns     288ns  487.07us  cudaGetLastError
                    0.23%  24.655ms        54  456.57us  15.424us  14.183ms  cudaMalloc
                    0.06%  6.7991ms         5  1.3598ms  171.52us  5.3080ms  cuMemAlloc
                    0.06%  6.0980ms      1818  3.3540us  1.6960us  285.41us  cuStreamWaitEvent
                    0.04%  3.9055ms       201  19.430us  16.800us  46.559us  cuCtxSynchronize
                    0.02%  1.8859ms       201  9.3820us  7.9040us  24.831us  cudaEventRecord
                    0.01%  1.4191ms         2  709.55us  379.39us  1.0397ms  cuMemHostAlloc
                    0.01%  1.3228ms       424  3.1190us  1.7280us  29.280us  cudaEventCreateWithFlags
                    0.01%  549.86us        34  16.172us  8.0320us  28.672us  cudaStreamSynchronize
                    0.00%  510.59us         4  127.65us  51.200us  200.03us  cudaMemcpy
                    0.00%  460.51us         1  460.51us  460.51us  460.51us  cudaFreeHost
                    0.00%  445.25us         8  55.655us  14.144us  173.22us  cudaMemsetAsync
                    0.00%  397.18us       288  1.3790us     320ns  56.864us  cuDeviceGetAttribute
                    0.00%  376.32us         2  188.16us  164.96us  211.36us  cudaHostAlloc
                    0.00%  371.46us        15  24.763us  10.559us  76.288us  cudaGetDeviceProperties
                    0.00%  343.10us        17  20.182us  11.551us  43.712us  cudaCreateTextureObject
                    0.00%  221.82us         4  55.455us  30.720us  124.13us  cuStreamCreate
                    0.00%  199.94us         8  24.992us  3.8080us  52.192us  cudaStreamCreateWithPriority
                    0.00%  168.74us         1  168.74us  168.74us  168.74us  cuMemsetD32
                    0.00%  132.29us        44  3.0060us  1.8880us  23.552us  cudaEventDestroy
                    0.00%  97.184us        76  1.2780us     768ns  3.5520us  cudaDeviceGetAttribute
                    0.00%  68.223us        12  5.6850us  4.2240us  14.400us  cudaStreamDestroy
                    0.00%  65.088us         4  16.272us  5.4080us  31.712us  cuDeviceTotalMem
                    0.00%  60.448us        20  3.0220us     896ns  8.2240us  cudaGetDevice
                    0.00%  51.264us         2  25.632us  23.360us  27.904us  cuMemGetInfo
                    0.00%  41.984us         9  4.6640us  1.9520us  14.368us  cuCtxSetCurrent
                    0.00%  41.536us         1  41.536us  41.536us  41.536us  cuDeviceGetProperties
                    0.00%  37.056us         2  18.528us  15.168us  21.888us  cudaDeviceSynchronize
                    0.00%  34.111us         4  8.5270us  6.2400us  11.008us  cudaThreadSynchronize
                    0.00%  16.448us         6  2.7410us  1.3440us  5.4720us  cuEventCreate
                    0.00%  13.408us        17     788ns     576ns  1.8560us  cudaCreateChannelDesc
                    0.00%  13.376us         3  4.4580us     896ns  10.176us  cudaGetDeviceCount
                    0.00%  13.216us         2  6.6080us  6.5920us  6.6240us  cudaHostGetDevicePointer
                    0.00%  11.904us         2  5.9520us  5.0240us  6.8800us  cudaSetDevice
                    0.00%  10.848us         9  1.2050us     576ns  2.3360us  cuDeviceGetCount
                    0.00%  10.304us         3  3.4340us  2.5600us  4.1920us  cuInit
                    0.00%  6.6880us         4  1.6720us  1.2800us  2.1120us  cuDeviceGetName
                    0.00%  6.6560us         4  1.6640us  1.2160us  2.3680us  cuDriverGetVersion
                    0.00%  6.3680us         5  1.2730us     640ns  2.3040us  cuDeviceGet
                    0.00%  5.9520us         2  2.9760us  2.6560us  3.2960us  cudaDeviceGetStreamPriorityRange
                    0.00%  4.4160us         1  4.4160us  4.4160us  4.4160us  cuDeviceGetPCIBusId
                    0.00%  2.6560us         1  2.6560us  2.6560us  2.6560us  cuDeviceComputeCapability
                    0.00%  1.1520us         1  1.1520us  1.1520us  1.1520us  cuDevicePrimaryCtxGetState
                    0.00%     448ns         1     448ns     448ns     448ns  cuCtxGetCurrent

I wrote a blog post about my experience using the NVIDIA-Jetson/tf_trt_models code. I also shared a script about how to do real-time object detection with various cameras or file inputs. Feel free to check it out. Do let me know if you have suggestions about the code. I’ll update my blog post and my GitHub repo as needed.

https://jkjung-avt.github.io/tf-trt-models/
https://github.com/jkjung-avt/tf_trt_models

Awesome! Thanks for sharing this jkjung13.

As for the performance discrepancy / low GPU utilization. This may have to do with how the object detection post-processing pipeline is configured.

It seems that the default box score threshold for the non-maximum suppression stage is 1e-8, which essentially considers any box a detection. This may result in unnecessary box-to-box comparisons and a heavier CPU load. This parameter may be found here

https://github.com/tensorflow/models/blob/17fa52864bfc7a7444a8b921d8a8eb1669e14ebd/research/object_detection/samples/configs/ssd_mobilenet_v1_coco.config#L130

I believe the benchmarks in tf_trt_models were collected using a threshold of 0.3. Are your models using a very low threshold? If so could you try raising this to something larger (say above 0.1) and report the performance?

Thanks!

Wow, that matters a lot! With a threshold of 0.3, I get a running time of 41.3ms using Tensorflow 1.10 and TensorRT4 for the ssd_inception_v2 model, which is a lot faster than your reported time (maybe because I use a different image so the NMS has even less boxes to compare?) Anyway, thanks, I consider this solved :)

With the official Tensorflow 1.9 I get 113ms now; I don’t really know what’s wrong but it seems the graph optimization doesn’t work at all now. It doesn’t really matter, probably just some conflicting versions of TensorRT and Tensorflow on my side…

@frederiki3k63, are you using JetPack-3.3 (with TensorRT 4.0 GA) on Jetson TX2? And which tensorflow-1.10 wheel did you use?

I’m stuck with the error: “E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cudnnFusedConvActLayer.cpp (64) - Cuda Error in createFilterTextureFused: 11” when I test with JetPack-3.3 and the “TF-1.10.1 for JetPack3.3” wheel (https://nvidia.app.box.com/v/TF1101-Py35-wTRT) linked from here: https://devtalk.nvidia.com/default/topic/1031300/jetson-tx2/tensorflow-1-8-wheel-with-jetpack-3-2-/

I used that wheel too. I use Jetpack 3.2.1 now, but according to the JetPack website Jetpack 3.2.1 is the same as JetPack 3.3 apart from newer CUDA and CuDNN versions, which I use the JetPack 3.3 versions of (tensorrt_4.0.2.0-1+cuda9.0_arm64.deb and libcudnn7-dev_7.1.5.14-1+cuda9.0_arm64.deb, which is TensorRT 4.0 GA I think).

Weird that it doesn’t work for you. The installation of Tensorflow 1.10 wheel was a bit of a hassle for me; I can’t seem to compile h5py which is a dependency for keras which is a dependency for tensorflow so I skipped that using pip --no-deps. And as reported earlier (and same as you report on your blog) parsing a network from string takes ~10 minutes using this version.

Thanks for your blog btw, I enjoy your clearly written articles, they have helped me much in the past!

@frederiki3k63, thanks for your kind words.

Today I fell back to JetPack-3.2.1 (TensorRT 3.0 GA) and tested my scripts against the tensorflow 1.8.0 wheel (https://nvidia.app.box.com/v/TF180-Py35-wTRT) as specified in https://github.com/NVIDIA-Jetson/tf_trt_models/blob/master/README.md. And it indeed worked better! After setting score_threshold to 0.3, I was able to get ssd_mobilenet_v1_coco to do real-time object detection at ~20fps, just as advertised by NVIDIA. In addition, the trt optimization process ran much faster (only took 1~2 minutes) under this configuration.

I’m going to experiment more and try finding a way to make it work equally well on JetPack-3.3.

Otherwise, it’d be ideal if NVIDIA people could re-build the tensorflow wheels and verify tf_trt_models code against JetPack-3.3.

I confirmed that the slowness on Jetson TX2 (loading ssd models, optimizing model with TensorRT, and loading optimized graph, etc.) has a lot to do with the version of tensorflow. My guess is that some recent changes in tensorflow do not work that well on aarch64 architecture.

Based on my testing, TF-TRT works great with tensorflow 1.8.0. I’ve tested it on both JetPack-3.2.1 and JetPack-3.3.

For more details, please read my blog post and the README.md in my GitHub repo.

https://jkjung-avt.github.io/tf-trt-models/
https://github.com/jkjung-avt/tf_trt_models