Low FPS on Jetson Nano using TensorRT


I recently got my hands on Jetson Nano and deployed simple image classification which I created in keras with only 3 classes. I followed this blog to convert to tensorrt with FP32 precision.
I did inference using webcam, model loaded in approx. 14 secs and got avg. 7.5 FPS utilizing 1.5GB ram.
I also did inference using Tensorflow model, model loaded in approx. 4 mins and also for this I had to increase swapfile size to 6GB in order to meet it’s memory demand after utilizing it’s 4GB ram memory or else process would get killed. This tf model was giving avg. 17 FPS

The question is why TensorRT is not giving better FPS as it optimized, am I missing something?



It’s recommended to test your model with trtexec to see the optimal performance first.
Suppose you have generated the onnx format from the blog shared above, then please try these command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks
$ /usr/src/tensorrt/bin/trtexec --onnx=[your/model]
$ /usr/src/tensorrt/bin/trtexec --onnx=[your/model] --fp16



Here is the output

&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=cnn3.onnx
    [08/24/2020-15:57:12] [I] === Model Options ===
    [08/24/2020-15:57:12] [I] Format: ONNX
    [08/24/2020-15:57:12] [I] Model: cnn3.onnx
    [08/24/2020-15:57:12] [I] Output:
    [08/24/2020-15:57:12] [I] === Build Options ===
    [08/24/2020-15:57:12] [I] Max batch: 1
    [08/24/2020-15:57:12] [I] Workspace: 16 MB
    [08/24/2020-15:57:12] [I] minTiming: 1
    [08/24/2020-15:57:12] [I] avgTiming: 8
    [08/24/2020-15:57:12] [I] Precision: FP32
    [08/24/2020-15:57:12] [I] Calibration: 
    [08/24/2020-15:57:12] [I] Safe mode: Disabled
    [08/24/2020-15:57:12] [I] Save engine: 
    [08/24/2020-15:57:12] [I] Load engine: 
    [08/24/2020-15:57:12] [I] Builder Cache: Enabled
    [08/24/2020-15:57:12] [I] NVTX verbosity: 0
    [08/24/2020-15:57:12] [I] Inputs format: fp32:CHW
    [08/24/2020-15:57:12] [I] Outputs format: fp32:CHW
    [08/24/2020-15:57:12] [I] Input build shapes: model
    [08/24/2020-15:57:12] [I] Input calibration shapes: model
    [08/24/2020-15:57:12] [I] === System Options ===
    [08/24/2020-15:57:12] [I] Device: 0
    [08/24/2020-15:57:12] [I] DLACore: 
    [08/24/2020-15:57:12] [I] Plugins:
    [08/24/2020-15:57:12] [I] === Inference Options ===
    [08/24/2020-15:57:12] [I] Batch: 1
    [08/24/2020-15:57:12] [I] Input inference shapes: model
    [08/24/2020-15:57:12] [I] Iterations: 10
    [08/24/2020-15:57:12] [I] Duration: 3s (+ 200ms warm up)
    [08/24/2020-15:57:12] [I] Sleep time: 0ms
    [08/24/2020-15:57:12] [I] Streams: 1
    [08/24/2020-15:57:12] [I] ExposeDMA: Disabled
    [08/24/2020-15:57:12] [I] Spin-wait: Disabled
    [08/24/2020-15:57:12] [I] Multithreading: Disabled
    [08/24/2020-15:57:12] [I] CUDA Graph: Disabled
    [08/24/2020-15:57:12] [I] Skip inference: Disabled
    [08/24/2020-15:57:12] [I] Inputs:
    [08/24/2020-15:57:12] [I] === Reporting Options ===
    [08/24/2020-15:57:12] [I] Verbose: Disabled
    [08/24/2020-15:57:12] [I] Averages: 10 inferences
    [08/24/2020-15:57:12] [I] Percentile: 99
    [08/24/2020-15:57:12] [I] Dump output: Disabled
    [08/24/2020-15:57:12] [I] Profile: Disabled
    [08/24/2020-15:57:12] [I] Export timing to JSON file: 
    [08/24/2020-15:57:12] [I] Export output to JSON file: 
    [08/24/2020-15:57:12] [I] Export profile to JSON file: 
    [08/24/2020-15:57:12] [I] 

Input filename:   cnn3.onnx
ONNX IR version:  0.0.4
Opset version:    8
Producer name:    tf2onnx
Producer version: 1.6.3
Model version:    0
Doc string:       

[08/24/2020-15:57:17] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/24/2020-15:57:19] [E] [TRT] Network has dynamic or shape inputs, but no optimization profile has been defined.
[08/24/2020-15:57:19] [E] [TRT] Network validation failed.
[08/24/2020-15:57:19] [E] Engine creation failed
[08/24/2020-15:57:19] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=cnn3.onnx



May I know how do you run the TensorRT inference of the original post?
It looks like your model is using the dynamic shape, is that correct?



Here is the inference code

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)

serialized_plan_fp32 = "model/cnn_model.plan"
# image size used in training 
img_size = 228
HEIGHT = img_size
WIDTH = img_size

engine = eng.load_engine(trt_runtime, serialized_plan_fp32)
h_input, d_input, h_output, d_output, stream = inf.allocate_buffers(engine, 1, trt.float32)

vs = cv2.VideoCapture(0) 

while True:
	_, frame = vs.read()
	image = process_img(frame)

	pred = inf.do_inference(engine, image, h_input, d_input, h_output, d_output, stream, 1, HEIGHT, WIDTH)
	cv2.imshow("Frame", frame)
	key = cv2.waitKey(1) & 0xFF
	if key == ord("q"):


To be honest I don’t know what dynamic shape is. Deep learning is still relatively new to me, I’m currently pursuing UG.

Thanks for your time.


Could you share the source of do_inference?
If this is implemented in a library, would you mind to share which module do you import with us?


do_inference code is same as in above mentioned blog.

def do_inference(engine, pics_1, h_input_1, d_input_1, h_output, d_output, stream, batch_size, height, width):
   This is the function to run the inference
      engine : Path to the TensorRT engine 
      pics_1 : Input images to the model.  
      h_input_1: Input in the host         
      d_input_1: Input in the device 
      h_output_1: Output in the host 
      d_output_1: Output in the device 
      stream: CUDA stream
      batch_size : Batch size for execution time
      height: Height of the output image
      width: Width of the output image
      The list of output images


   load_images_to_buffer(pics_1, h_input_1)

   with engine.create_execution_context() as context:
       # Transfer input data to the GPU.
       cuda.memcpy_htod_async(d_input_1, h_input_1, stream)

       # Run inference.

       context.profiler = trt.Profiler()
       context.execute(batch_size=1, bindings=[int(d_input_1), int(d_output)])

       # Transfer predictions back from the GPU.
       cuda.memcpy_dtoh_async(h_output, d_output, stream)
       # Synchronize the stream
       # Return the host output.
       out = h_output
       return out


Sorry for the missing.

It looks like you already have an engine file model/cnn_model.plan.
So you can benchmark the inference with following command directly:

/usr/src/tensorrt/bin/trtexec --loadEngine=cnn_model.plan