TensorRT inference time increase

Description

case one:
for …
result_list = yolov4_wrapper.check_person(frame)
time_cur_detect = time.time() - st_time
print(‘num = {} --------------------------------------detect_time(ms): {}’.format(num, time_cur_detect * 1000.0))

results
num = 0 --------------------------------------detect_time(ms): 86.98105812072754

num = 1210 --------------------------------------detect_time(ms): 40.47203063964844

case two:
for …
result_list = yolov4_wrapper.check_person(frame)
time_cur_detect = time.time() - st_time
print(‘num = {} --------------------------------------detect_time(ms): {}’.format(num, time_cur_detect * 1000.0))
for i in range(10000000):
pass

results
num = 0 --------------------------------------detect_time(ms): 85.25681495666504

num = 60 --------------------------------------detect_time(ms): 76.00593566894531

question
if the post processing is time consuming,then the TensorRT inference time will increase, so how to
solve this problem?

Environment

TensorRT Version: TensorRT-7.1.3.4
GPU Type: Tesla T4
Nvidia Driver Version: Driver Version: 450.80.02
CUDA Version: CUDA Version: 11.0
CUDNN Version:
Operating System + Version: ubuntu1604
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance

Thanks!

i have test TensorRT-7.1.3.4/samples/python/yolov3_onnx/,
code snippet

with get_engine(onnx_file_path, engine_file_path) as engine, engine.create_execution_context() as context:
    total_time = 0
    exec_times = 100
    for _ in range(exec_times):
        ############################################################################################################
        print('Running inference on image {}...'.format(input_image_path))
        start0 = time.time()
        image_raw, image = preprocessor.process(input_image_path)
        # Store the shape of the original input image in WH format, we will need it for later
        shape_orig_WH = image_raw.size
        # Output shapes expected by the post-processor
        output_shapes = [(1, 255, 19, 19), (1, 255, 38, 38), (1, 255, 76, 76)]
        print("===> preprocessor time(TRT): %.5f(ms)" % ((time.time() - start0) * 1000.0))
        ############################################################################################################
        start1 = time.time()
        # Do inference with TensorRT
        trt_outputs = []
        inputs, outputs, bindings, stream = common.allocate_buffers(engine)
        # Do inference
        # Set host input to the image. The common.do_inference function will copy the input to the GPU before executing.
        inputs[0].host = image
        trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
        print("===> inference time(TRT): %.5f(ms)" % ((time.time() - start1) * 1000.0))
        ############################################################################################################
        total_time += (time.time() - start0)*1000.0
        print("===> total inference time(TRT): %.5f(ms)" % ((time.time() - start0)*1000.0))
        
        # post for process
        # for i in range(10000000):
        #     pass
    print('average processing time: %.5f(ms)' % (total_time / exec_times))

without ‘post for process’

average processing time: 69.18464(ms)

with ‘post for process’

average processing time: 98.64011(ms)

Hi @chengweige517,

We shouldn’t allocate memory in the inference loop. Please fix this and try.
inputs, outputs, bindings, stream = common.allocate_buffers(engine)

Thank you.

Hi @spolisetty ,
Thank you for your reply,


inputs, outputs, bindings, stream = common.allocate_buffers(engine)
for _ in range(exec_times):

by fixing the code, i got the same result:

without ‘post for process’
average processing time: 54.17452(ms)

with ‘post for process’
average processing time: 82.93529(ms)

Hi @chengweige517,

Could you please share us issue reproducible modified scripts and model file for better assistance.

Thank you.