Stream.synchronize() is slow (python API)

751180903 · July 9, 2021, 3:24am

Description

During inference, stream.synchronize() is very slow. Is there any approach to get rid of it ?

Environment

TensorRT Version: 8.0.0.3
GPU Type: T4
Nvidia Driver Version: 450
CUDA Version: 11.0
CUDNN Version: 8.2.0
Operating System + Version: CENTOS7
Python Version (if applicable): 3.7.19
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

def do_inference(context, bindings, inputs, outputs, stream, batch_size):
    # Transfer data from CPU to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

Data transfer between cpu & gpu and execution speed is both okay.
stream.synchronize() almost takes 90% percent of time in this function.

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

NVES · July 9, 2021, 3:37am

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance

Thanks!

751180903 · July 9, 2021, 3:49am

@NVES Hi,

Please find below my trt engine.
test.trt (20.7 MB)

It is a dynamic-input model. Input shapes are ranging from (1,3,100,100) to (500, 3, 410, 224) and output shape is (bs, 2).

When i tested a (450, 3, 394, 224) input on my machine, the synchronize step takes ~0.66s while the execution step only takes 0.001s.

Thanks.

spolisetty · July 9, 2021, 5:05am

Hi @751180903,

This looks like CUDA stream related. We recommend you to please post your concern on CUDA forum to get better help.

Thank you.

ycy1164656 · August 19, 2021, 9:49am

but it’s in your tensorrt demo code, do not make cuda-part your scapegoat…could you explain why this line code is here

SunilJB · August 24, 2021, 1:26pm

Hi @ycy1164656
By specifying a stream, the CUDA API calls become asynchronous, meaning that the call may return before the command has been completed. Memory transfer instructions and kernel invocation can use CUDA stream.
stream.synchronize() synchronize all streams on a CUDA device.
Sample codes are generalized for multiple inputs/outputs just for your reference. You can modify the code to make it all synchronous based on your use case.

Thanks

Topic		Replies	Views
TensorRT multi stream TensorRT	3	2671	February 29, 2024
Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One TensorRT tensorrt	5	2463	March 30, 2023
Cuda transfer from device to host is extremely slow TensorRT cuda	5	2580	February 13, 2022
Tensorrt Threads affect each other during multithreaded inference TensorRT tensorrt	16	1371	September 6, 2024
`cudaMemcpyHostToDevice` is very slow CUDA Programming and Performance	8	1990	December 14, 2018
Terrible scaling behavior of TensorRT using C++ API TensorRT tensorrt , cudnn	5	36	March 6, 2025
How can I optimize multi-batch and parallel inference in TensorRT for faster performance on high-resolution image patches? TensorRT tensorrt , cuda , ubuntu , python , cudnn , deep-learning	2	90	December 2, 2024
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference.How can i do that TensorRT tensorrt , cuda , jetson-nano	3	762	March 13, 2023
TensorRT Inference using multi stream TensorRT	0	58	August 9, 2024
Slow inference on custom model TensorRT cuda , tensorflow , cudnn , jetson	3	24	April 30, 2025

Stream.synchronize() is slow (python API)

Description

Environment

Relevant Files

Steps To Reproduce

Related topics