converting a frozen graph to tensorRT

Hi,
I am trying to convert a ssd_mobilenet_v2_coco model to a tensorRT model on jetson Nano. I have trained the model with just one class on my laptop that has this specification:
CPU: Intel i7-8750H @ 2.2GHz x12, 8GB RAM
GPU: nvidia Quadro P600, 4GB
I can run the inference on my laptop at around 13Hz but with it takes around 70/80 seconds to run on Jetson Nano. I have Jetpack 4.2 on it along with tensorflow 1.14.0+nv19.10 installed as per nvidia guidelines. Strangely if I use the cpu only (by setting os.environ[“CUDA_AVAILABLE_DEVICES”]=’-1’, inference time is around 3 seconds.

I am now trying to convert the frozen graph to TF_TRT model using TrtConverter as shown https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html, however, if i use gpu_option “allow_soft_placement=True”, the speed is very bad again and with “allow_soft_placement=False”, the code stops with the error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation Preprocessor/map/TensorArray_2: Could not satisfy explicit device specification ‘/device:GPU:0’ because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_=’/device:GPU:0’ assigned_device_name_=’’ resource_device_name_=’’ supported_device_types_=[CPU, XLA_CPU, XLA_GPU] possible_devices_=
TensorArrayGatherV3: GPU CPU XLA_CPU XLA_GPU
Enter: GPU CPU XLA_CPU XLA_GPU
TensorArrayV3: CPU XLA_CPU XLA_GPU
TensorArrayWriteV3: CPU XLA_CPU XLA_GPU
TensorArraySizeV3: GPU CPU XLA_CPU XLA_GPU
Const: GPU CPU XLA_CPU XLA_GPU
Range: GPU CPU XLA_CPU XLA_GPU

Colocation members, user-requested devices, and framework assigned devices, if any:
Preprocessor/map/TensorArray_2 (TensorArrayV3) /device:GPU:0
Preprocessor/map/while/ResizeImage/stack_1 (Const) /device:GPU:0
Preprocessor/map/while/TensorArrayWrite_1/TensorArrayWriteV3/Enter (Enter) /device:GPU:0
Preprocessor/map/while/TensorArrayWrite_1/TensorArrayWriteV3 (TensorArrayWriteV3) /device:GPU:0
Preprocessor/map/TensorArrayStack_1/TensorArraySizeV3 (TensorArraySizeV3) /device:GPU:0
Preprocessor/map/TensorArrayStack_1/range/start (Const) /device:GPU:0
Preprocessor/map/TensorArrayStack_1/range/delta (Const) /device:GPU:0
Preprocessor/map/TensorArrayStack_1/range (Range) /device:GPU:0
Preprocessor/map/TensorArrayStack_1/TensorArrayGatherV3 (TensorArrayGatherV3) /device:GPU:0

Op: TensorArrayV3
Node attrs: element_shape=, dynamic_size=false, clear_after_read=true, identical_element_shapes=true, tensor_array_name="", dtype=DT_INT32
Registered kernels:
device=‘XLA_GPU’; dtype in [DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT8, …, DT_QINT32, DT_BFLOAT16, DT_HALF, DT_UINT32, DT_UINT64]
device=‘XLA_CPU’; dtype in [DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT8, …, DT_BFLOAT16, DT_COMPLEX128, DT_HALF, DT_UINT32, DT_UINT64]
device=‘XLA_CPU_JIT’; dtype in [DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT8, …, DT_BFLOAT16, DT_COMPLEX128, DT_HALF, DT_UINT32, DT_UINT64]
device=‘XLA_GPU_JIT’; dtype in [DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT8, …, DT_QINT32, DT_BFLOAT16, DT_HALF, DT_UINT32, DT_UINT64]
device=‘GPU’; dtype in [DT_BFLOAT16]
device=‘GPU’; dtype in [DT_INT64]
device=‘GPU’; dtype in [DT_COMPLEX128]
device=‘GPU’; dtype in [DT_COMPLEX64]
device=‘GPU’; dtype in [DT_DOUBLE]
device=‘GPU’; dtype in [DT_FLOAT]
device=‘GPU’; dtype in [DT_HALF]
device=‘CPU’

 [[{{node Preprocessor/map/TensorArray_2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “fulldetect_test_with_trt_R2.py”, line 231, in
solve.detect_quad()
File “fulldetect_test_with_trt_R2.py”, line 214, in detect_quad
self.process_image_and_plot(img, category_index)
File “fulldetect_test_with_trt_R2.py”, line 124, in process_image_and_plot
(boxes, scores, classes, num_detections) = self.session.run([boxes, scores, classes, num_detections], feed_dict={image_tensor: image_np_expanded})
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py”, line 950, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py”, line 1173, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py”, line 1350, in _do_run
run_metadata)

Probably I am making some error in converting the model. I have also tried to use create_inference_graph instead of TrtConverter but it has the same outcome. I have noticed that with create_inference_graph, the optimized graph is bigger than the original trained frozen graph.

Can you please let me know how I can overcome these issues or point towards right places to look. I am trying random few solutions that I get online but none seem to be working. Do I need to do anything while training or converting trained model to frozen graph in order to use the graph on Jetson Nano?

Let me know if any other info is needed.

Thanks.

Hi,
The problem was definitely not getting the trt model right. I used my training file and detection.py script from https://github.com/NVIDIA-AI-IOT/tf_trt_models and can get the inference at about 6Hz. Although haven’t yet tested with a live camera stream. It still takes longer to load models and start tf session so time for first inference is 2/3 mins. Is it possible to speed it up further? How can I check the version of protobuf (“protoc --version” doesnt work) and ensure that I have the correct version. when I run "python -c “from google.protobuf.internal import api_implementation; print(api_implementation._default_implementation_type)”, the Jetson returns python while my laptop returns cpp. I tried to run export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp but that doesnt change anything.

I have trained the network with picture size 480x640, would changing it to 300x300 and resizing the image while inferencing improve the performance? Is there a setting I can play with in gpu_options or while creating inference graph? I am using 'FP16" precision for now in create_inference_graph. I am trying them but would really appreciate any help.

Thanks

Moving this to the Jetson Nano forum so the Jetson team can take a look.

Hi,

Sorry for the late update.

1. Have you maximized the Nano performance?

sudo nvpmodel -m 0
sudo jetson_clocks

This will turn on Nano into performance model.

2. It’s recommended to convert the model into pure TensorRT rather than TF-TRT.
TensorFlow run really slow on the Jetson Nano and occupies almost all the memory.
Please try to follow this tutorial to convert your model into pure TensorRT.
https://github.com/AastaNV/TRT_object_detection

ssd_mobilenet_v2_coco takes around 46ms with the old JetPack4.2.x software.

Thanks.

Hi, Thanks for the reply. I had already maximized the nano performance. I got busy doing something else so did not work on this further but I was able to reduce the image size to 300x300 and could get the inference at about 19Hz with TF-TRT model. I will start again in some time and then will try to convert my model to pure TensorRT model to get further improvement.

Thanks again,
Indrajeet