Just adding a `cv2.imread` will make the inference time increased 80% though the imread result not used

ENVIRON:
jetson nano 4g
docker image: nvcr.io/nvidia/l4t-ml:r32.7.1-py3

I will give the complete code case later, here’s my callback code

def my_callback():
    # cv2.imread("./00041.jpg")
    astart = time.time()
    runner._infer(pad_img)
    aend = time.time()
    
    runner_time = aend - astart
    print("runner time is %.3f"%(runner_time))
    return 

the runner time is 0.014 and when comment out the cv2.imread("./00041.jpg"), the runner time will be 0.026, it is realy weird, hope someone can figure me out.

here’s the complete code

import time
import numpy as np
import cv2


import tensorrt as trt
import pycuda.autoinit
import pycuda.driver as cuda


# Return a batch of image dir  when `send` is called
class Team:
    def __init__(self):
       pass

        
    def run(self, callback):
        for _ in range(400):
            callback()
       

team = Team()

class RUNNER(object):
    def __init__(self, engine, batch_size):
        #cuda.init()

        logger = trt.Logger(trt.Logger.WARNING)
        logger.min_severity = trt.Logger.Severity.ERROR
        trt.init_libnvinfer_plugins(logger,'')
        
        self.batch_size = batch_size
        self.context = engine.create_execution_context()
        self.imgsz = engine.get_binding_shape(0)[2:]
        self.inputs, self.outputs, self.bindings = [], [], []
        self.stream = cuda.Stream()

        for binding in engine:
            size = trt.volume(engine.get_binding_shape(binding))
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            if engine.binding_is_input(binding):
                self.inp_size = size 
                host_mem = cuda.pagelocked_empty(size, dtype)
                device_mem = cuda.mem_alloc(host_mem.nbytes)
                self.bindings.append(int(device_mem))
                self.inputs.append({'host': host_mem, 'device': device_mem})
            else:
                host_mem = cuda.pagelocked_empty(size, dtype)
                device_mem = cuda.mem_alloc(host_mem.nbytes)
                self.bindings.append(int(device_mem))
                self.outputs.append({'host': host_mem, 'device': device_mem})



    def _infer(self, img):
        
        infer_num = img.shape[0]
        # padding img if the last is less than batch_size 
        img_flatten = np.ravel(img)
        pad_zeros = np.zeros(self.inp_size - img_flatten.shape[0], dtype=np.float32)
        img_inp = np.concatenate([img_flatten, pad_zeros], axis=0)
        self.inputs[0]['host'] = img_inp

        for inp in self.inputs:
            cuda.memcpy_htod(inp['device'], inp['host'])

        # run inference
        self.context.execute_v2(
            bindings=self.bindings,
            )

        # fetch outputs from gpu
        for out in self.outputs:
            cuda.memcpy_dtoh_async(out['host'], out['device'])

        # synchronize stream
        data = [out['host'] for out in self.outputs]
        return infer_num, data


def _get_engine(engine_path):
    logger = trt.Logger(trt.Logger.WARNING)
    logger.min_severity = trt.Logger.Severity.ERROR
    runtime = trt.Runtime(logger)
    trt.init_libnvinfer_plugins(logger,'') # initialize TensorRT plugins
    with open(engine_path, "rb") as f:
        serialized_engine = f.read()
    engine = runtime.deserialize_cuda_engine(serialized_engine)
    return engine

batch_size = 1
engine_path = './test.batch1.fp16.trt'
engine = _get_engine(engine_path)
runner = RUNNER(engine, batch_size)

pad_img = np.load('./data_640x384_batch1_nonorm.npy')

def my_callback():
    cv2.imread("./00041.jpg")
    astart = time.time()
    runner._infer(pad_img)
    aend = time.time()
    
    runner_time = aend - astart
    print("runner time is %.3f"%(runner_time))
    return 

team.run(my_callback)

Hi,

CPU run code sequentially.
imread contains some data loading and decoder processes so it might take some time.
If it can be run parallel, you can run it on a thread to reduce the impact.

Thanks.

Thanks for your answer.

But I only count the infer time, as the code following:

    astart = time.time()
    runner._infer(pad_img)
    aend = time.time()

I think the cv2.imread will not effct the infer time. Please tell me if I am wrong.

Hi,

Could you run the tegratstats at the same time and share the output with us?

$ sudo tegrastats

Thanks.

Hi,
the following is the log that comment out the cv2.imread line.
comment_out.log (4.1 KB)

the following is the log that keep the cv2.imread work
use_cv2_resize.log (10.1 KB)

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

Could you also add the timer before and after imread.

Thanks.