Nvidia Jetson NX extremely slow even with TensorRT inference for yolov3

p.carvalho · August 23, 2021, 9:01am

Hello

I’m using a Nvidia Jetson NX for object detection. I’m using yolov3 as algorithm for the detection. What I’m noticing is that without tensorrt, the device freezes while doing detection. I need to abort the program. With tensorRT the performance is better, but lets put this, it has a huge delay on inference. For example, a frame that happened 10 minutes ago is still not processed and it will take a lot more to be processed. The time accumulates exponentially. I’m using yolov3 608 and a webcam that is connected by IP address. To refer that while doing detections I’m writing the frame on a postgres database. I don’t think this is normal, taking into account the power mode is maximum. Is there any trick to improve the performance? I already maximized the clocks. There is also a red exclamation point next to the power mode.

Thank you in advance

Environment

TensorRT Version : 8.0
GPU Type : Nvidia Jetson NX
Nvidia Driver Version :
CUDA Version : 10.2
CUDNN Version : 8.2.1
Operating System + Version : Ubuntu 18.04

p.carvalho · August 23, 2021, 4:48pm

So, I tried running the code on a second Nvidia and the issue persists. What can cause this?

dkreutz · August 23, 2021, 8:09pm

Which type of power supply are you using?

AastaLLL · August 24, 2021, 2:53am

Hi,

Do you use the YOLOv3 example located in the TensorRT folder?
If yes, would you mind switching to our Deepstream example instead?

/opt/nvidia/deepstream/deepstream-5.1/sources/objectDetector_Yolo

The TensorRT sample targets for model conversion demonstration.
It doesn’t optimize the camera/display pipeline which might cause latency.

In the Deepstream sample, we can reach 1080p @30fps with YOLOv3 on XavierNX.
This indicates the latency should be within 33ms.
(Please use JetPack 4.5.1 since we don’t have Deesptream support for v4.6 yet.)

Thanks.

p.carvalho · August 24, 2021, 8:31am

Hi,

The one that came with the board

p.carvalho · August 24, 2021, 8:54am

Hi,

I’m not using the example, I’m using a custom app. Anyway, I know that yolov3 works fine on Nvidia and with Tensorrt it should be better. But the way this is now, the device struggles to process the frames, although better than when I didn’t use tensorrt. Is there anything needed to be activated on the board? It feels like I’m using a power save mode or something. I’m also noticing that while moving the cursor, a trail appears, like if the refresh rate is very low. I need to mention, the version of the kit is american, and I’m in Portugal. Is it possible that enough energy is not being outputed to the board?

p.carvalho · August 25, 2021, 8:32am

Can someone help? This is a company project and needs to have a solution ASAP.

Thanks

dkreutz · August 25, 2021, 2:46pm

First I would follow @AastaLLL suggestion and run the examples in order to see if you can reach the reference performance values on the same device.

How much memory is used/free when running your application?

p.carvalho · August 25, 2021, 2:55pm

Right now I can’t test it, but I’ll try later.
I also tried using the darknet example before (the one in github). The performance is horrendous while using a webcam. I’m using rtsp protocol for streaming by the way. The thing is the custom app works fine on a normal computer.

Thank you for the help

AastaLLL · August 26, 2021, 6:40am

Hi,

It’s recommended to use Depepstream for the RTSP source.
We have optimized the multimedia pipeline and the inference performance on Jetson.

For the power mode issue, have you tried below to maximize the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

p.carvalho · August 26, 2021, 9:33am

I tried those commands, I only see the power mode change to 15W and 2 cores. I’ll try with deepstream.

Thanks

p.carvalho · September 3, 2021, 1:30pm

Hello guys

So, the device continues extremely slow. I haven’t tried deepstream, but I don’t think that is the problem. You see, I noticed that between frames, the inference takes lots of time to process. After the inference, there are prints with info that appear fast. So basically the frame is read, it takes 2-3 seconds to infer, it prints info and reads the next frame in less than 1 second. so the inference is the problem. I’m also noticing “System throttled due to over-current” at the top right of the screen.

p.carvalho · September 8, 2021, 3:54pm

Hello

I still need help. Why is the inference so slow, even with tensorrt? Lack of power?

Regards

p.carvalho · September 8, 2021, 4:24pm

p.carvalho · September 8, 2021, 4:32pm

Take a look at my code, and see if there is something weird

def processa(self):

    """Create a TensorRT engine for ONNX-based YOLOv3-608 and run inference."""

    # Try to load a previously generated YOLOv3-608 network graph in ONNX format:
    onnx_file_path = "yolov3.onnx"
    engine_file_path = "yolov3.trt"
    # Download a dog image and save it to the following file path:
    dev = cuda.Device(0)  # 0 is your GPU number
    ctx = dev.make_context()
    with get_engine(
        onnx_file_path, engine_file_path
    ) as engine, engine.create_execution_context() as context:
        inputs, outputs, bindings, stream = common.allocate_buffers(engine)
        while True:
            ativo, frame = self.imagem.read()
            cv2.imwrite("frame.jpeg", frame)
            input_image_path = "frame.jpeg"
            # Two-dimensional tuple with the target network's (spatial) input resolution in HW ordered
            input_resolution_yolov3_HW = (608, 608)
            # Create a pre-processor object by specifying the required input resolution for YOLOv3
            preprocessor = PreprocessYOLO(input_resolution_yolov3_HW)
            # Load an image from the specified input path, and return it together with  a pre-processed version
            image_raw, image = preprocessor.process(input_image_path)
            # Store the shape of the original input image in WH format, we will need it for later
            shape_orig_WH = image_raw.size

            # Output shapes expected by the post-processor
            output_shapes = [(1, 255, 19, 19), (1, 255, 38, 38), (1, 255, 76, 76)]
            # Do inference with TensorRT
            trt_outputs = []

            # Do inference
            print("Running inference on image {}...".format(input_image_path))
            # Set host input to the image. The common.do_inference function will copy the input to the GPU before executing.
            inputs[0].host = image
            trt_outputs = common.do_inference_v2(
                context,
                bindings=bindings,
                inputs=inputs,
                outputs=outputs,
                stream=stream,
            )

            # Before doing post-processing, we need to reshape the outputs as the common.do_inference will give us flat arrays.
            trt_outputs = [
                output.reshape(shape)
                for output, shape in zip(trt_outputs, output_shapes)
            ]

            postprocessor_args = {
                "yolo_masks": [
                    (6, 7, 8),
                    (3, 4, 5),
                    (0, 1, 2),
                ],  # A list of 3 three-dimensional tuples for the YOLO masks
                "yolo_anchors": [
                    (10, 13),
                    (16, 30),
                    (33, 23),
                    (30, 61),
                    (
                        62,
                        45,
                    ),  # A list of 9 two-dimensional tuples for the YOLO anchors
                    (59, 119),
                    (116, 90),
                    (156, 198),
                    (373, 326),
                ],
                "obj_threshold": 0.6,  # Threshold for object coverage, float value between 0 and 1
                "nms_threshold": 0.5,  # Threshold for non-max suppression algorithm, float value between 0 and 1
                "yolo_input_resolution": input_resolution_yolov3_HW,
            }

            postprocessor = PostprocessYOLO(**postprocessor_args)

            # Run the post-processing algorithms on the TensorRT outputs and get the bounding box details of detected objects
            boxes, classes, scores = postprocessor.process(
                trt_outputs, (shape_orig_WH)
            )
            objetos_captuados_frame = []
            if boxes is not None:
                for i in range(len(boxes)):
                    objeto_no_frame = {}
                    x = round(boxes[i][0])
                    y = round(boxes[i][1])
                    w = round(boxes[i][2])
                    h = round(boxes[i][3])
                    objeto_no_frame["object_id"] = classes[i]
                    objeto_no_frame["confianca"] = scores[i]
                    objeto_no_frame["topLeft"] = [x, y]
                    objeto_no_frame["bottomRight"] = [w, h]
                    objetos_captuados_frame.append(objeto_no_frame)
                # Draw the bounding boxes onto the original input image and save it as a PNG file
                image_raw = draw_bboxes(
                    image_raw, boxes, scores, classes, ALL_CATEGORIES
                )
                output_image_path = "dog_bboxes.png"
                image_raw.save(output_image_path, "PNG")
                # _, buffer = cv2.imencode(".png", obj_detected_img)
                print(
                    "Saved image with bounding boxes of detected objects to {}.".format(
                        output_image_path
                    )
                )
            numpy_image = np.array(image_raw)

            # convert to a openCV2 image, notice the COLOR_RGB2BGR which means that
            # the color is converted from RGB to BGR format
            image = cv2.cvtColor(numpy_image, cv2.COLOR_RGB2BGR)
            self.framecurrente = (
                image  # obj_detected_img = obj_detected_img.tobytes()
            )

            videredb.guardaFrame(
                # obj_detected_img,
                image_raw,
                self.id_user,
                time.time(),
                objetos_captuados_frame,
            )
    ctx.pop()
    del ctx

p.carvalho · September 10, 2021, 11:49am

Guys? Apparently what is making the system so slow is the postprocess. How to solve?

Thanks

AastaLLL · September 17, 2021, 3:12am

Hi,

Would you mind giving Deepstream a try?

The python sample uses OpenCV related library for post-procecssing.
And it’s expected to be slow since the limited CPU power on Xavier NX.

For Deepstream, we have optimized the pipeline and leverage all the pre-/post- processing into different hardware.
This will give you much better performance than just use CPUs for post-processing.

Thanks.

p.carvalho · September 17, 2021, 8:36am

Ok, I didn’t said anything but I already tried deepstream, it is way faster, but there is still some delay. I mean it keeps processing old frames (for example at 5 minutes of app running, the frames that were captured at 2 minutes are still being processed).
I basically need to reset the capture of the frames in deepstream so that it gets the atual frame, not old ones. How to do it?

Thank you for your response

p.carvalho · September 21, 2021, 11:51am

Hi

Still needing help

Thanks

p.carvalho · September 22, 2021, 8:47am

Ok, it seems I was missing the calibrate file. Now it is working very fast. I would like to know what is the function or code line that is responsible for displaying the frame with the objects detected. I’m referring to deepstream-test3 that shows the stream with the red rectangles around the objects.

Thanks