Just adding a `cv2.imread` will make the inference time increased 80% though the imread result not used

794330684 · June 5, 2023, 9:26am

ENVIRON:
jetson nano 4g
docker image: nvcr.io/nvidia/l4t-ml:r32.7.1-py3

I will give the complete code case later, here’s my callback code

def my_callback():
    # cv2.imread("./00041.jpg")
    astart = time.time()
    runner._infer(pad_img)
    aend = time.time()
    
    runner_time = aend - astart
    print("runner time is %.3f"%(runner_time))
    return

the runner time is 0.014 and when comment out the cv2.imread("./00041.jpg"), the runner time will be 0.026, it is realy weird, hope someone can figure me out.

794330684 · June 5, 2023, 9:29am

here’s the complete code

import time
import numpy as np
import cv2


import tensorrt as trt
import pycuda.autoinit
import pycuda.driver as cuda


# Return a batch of image dir  when `send` is called
class Team:
    def __init__(self):
       pass

        
    def run(self, callback):
        for _ in range(400):
            callback()
       

team = Team()

class RUNNER(object):
    def __init__(self, engine, batch_size):
        #cuda.init()

        logger = trt.Logger(trt.Logger.WARNING)
        logger.min_severity = trt.Logger.Severity.ERROR
        trt.init_libnvinfer_plugins(logger,'')
        
        self.batch_size = batch_size
        self.context = engine.create_execution_context()
        self.imgsz = engine.get_binding_shape(0)[2:]
        self.inputs, self.outputs, self.bindings = [], [], []
        self.stream = cuda.Stream()

        for binding in engine:
            size = trt.volume(engine.get_binding_shape(binding))
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            if engine.binding_is_input(binding):
                self.inp_size = size 
                host_mem = cuda.pagelocked_empty(size, dtype)
                device_mem = cuda.mem_alloc(host_mem.nbytes)
                self.bindings.append(int(device_mem))
                self.inputs.append({'host': host_mem, 'device': device_mem})
            else:
                host_mem = cuda.pagelocked_empty(size, dtype)
                device_mem = cuda.mem_alloc(host_mem.nbytes)
                self.bindings.append(int(device_mem))
                self.outputs.append({'host': host_mem, 'device': device_mem})



    def _infer(self, img):
        
        infer_num = img.shape[0]
        # padding img if the last is less than batch_size 
        img_flatten = np.ravel(img)
        pad_zeros = np.zeros(self.inp_size - img_flatten.shape[0], dtype=np.float32)
        img_inp = np.concatenate([img_flatten, pad_zeros], axis=0)
        self.inputs[0]['host'] = img_inp

        for inp in self.inputs:
            cuda.memcpy_htod(inp['device'], inp['host'])

        # run inference
        self.context.execute_v2(
            bindings=self.bindings,
            )

        # fetch outputs from gpu
        for out in self.outputs:
            cuda.memcpy_dtoh_async(out['host'], out['device'])

        # synchronize stream
        data = [out['host'] for out in self.outputs]
        return infer_num, data


def _get_engine(engine_path):
    logger = trt.Logger(trt.Logger.WARNING)
    logger.min_severity = trt.Logger.Severity.ERROR
    runtime = trt.Runtime(logger)
    trt.init_libnvinfer_plugins(logger,'') # initialize TensorRT plugins
    with open(engine_path, "rb") as f:
        serialized_engine = f.read()
    engine = runtime.deserialize_cuda_engine(serialized_engine)
    return engine

batch_size = 1
engine_path = './test.batch1.fp16.trt'
engine = _get_engine(engine_path)
runner = RUNNER(engine, batch_size)

pad_img = np.load('./data_640x384_batch1_nonorm.npy')

def my_callback():
    cv2.imread("./00041.jpg")
    astart = time.time()
    runner._infer(pad_img)
    aend = time.time()
    
    runner_time = aend - astart
    print("runner time is %.3f"%(runner_time))
    return 

team.run(my_callback)

AastaLLL · June 6, 2023, 3:24am

Hi,

CPU run code sequentially.
imread contains some data loading and decoder processes so it might take some time.
If it can be run parallel, you can run it on a thread to reduce the impact.

Thanks.

794330684 · June 6, 2023, 6:54am

Thanks for your answer.

But I only count the infer time, as the code following:

    astart = time.time()
    runner._infer(pad_img)
    aend = time.time()

I think the cv2.imread will not effct the infer time. Please tell me if I am wrong.

AastaLLL · June 7, 2023, 3:38am

Hi,

Could you run the tegratstats at the same time and share the output with us?

$ sudo tegrastats

Thanks.

794330684 · June 7, 2023, 6:14am

Hi,
the following is the log that comment out the cv2.imread line.
comment_out.log (4.1 KB)

the following is the log that keep the cv2.imread work
use_cv2_resize.log (10.1 KB)

AastaLLL · June 14, 2023, 6:32am

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

Could you also add the timer before and after imread.

Thanks.

Topic		Replies	Views
Why my inference time is so long when using trtexec - FP16? Jetson TX2 jetson-inference	4	1940	October 18, 2021
There is a difference in inference speed in TensorRT 8 TensorRT tensorrt	4	503	October 28, 2021
Inference Speed Jetson Xavier NX pytorch	6	862	April 12, 2023
Tensorrt Inference in Real time Jetson Nano tensorrt , jetson-inference , gstreamer , python	8	1715	April 12, 2023
Inference is so slow with torch1.6 Jetson Xavier NX nvbugs , pytorch	12	3533	October 23, 2020
Inference time on jetson nano Jetson AGX Xavier tensorrt , cuda , kernel , jetson-inference	2	934	May 30, 2022
Inference time becomes longer when doing non-continuous fp16 or int8 inference TensorRT tensorrt , jetson-inference	33	3224	March 30, 2023
optimizing tf-trt load time Jetson Nano	12	4167	October 15, 2021
Decrease latency from Jetson-Inference model Jetson Nano cuda , jetson-inference	4	645	October 18, 2021
TensorRT 5.X / 6.X Batch Size Problem TensorRT	4	607	August 19, 2020

Just adding a `cv2.imread` will make the inference time increased 80% though the imread result not used

Related topics