Stream.synchronize() is slow (python API)

Description

During inference, stream.synchronize() is very slow. Is there any approach to get rid of it ?

Environment

TensorRT Version: 8.0.0.3
GPU Type: T4
Nvidia Driver Version: 450
CUDA Version: 11.0
CUDNN Version: 8.2.0
Operating System + Version: CENTOS7
Python Version (if applicable): 3.7.19
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

def do_inference(context, bindings, inputs, outputs, stream, batch_size):
    # Transfer data from CPU to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

Data transfer between cpu & gpu and execution speed is both okay.
stream.synchronize() almost takes 90% percent of time in this function.

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance

Thanks!

@NVES Hi,

Please find below my trt engine.
test.trt (20.7 MB)

It is a dynamic-input model. Input shapes are ranging from (1,3,100,100) to (500, 3, 410, 224) and output shape is (bs, 2).

When i tested a (450, 3, 394, 224) input on my machine, the synchronize step takes ~0.66s while the execution step only takes 0.001s.

Thanks.

Hi @751180903,

This looks like CUDA stream related. We recommend you to please post your concern on CUDA forum to get better help.

Thank you.

but it’s in your tensorrt demo code, do not make cuda-part your scapegoat…could you explain why this line code is here

Hi @ycy1164656
By specifying a stream, the CUDA API calls become asynchronous, meaning that the call may return before the command has been completed. Memory transfer instructions and kernel invocation can use CUDA stream.
stream.synchronize() synchronize all streams on a CUDA device.
Sample codes are generalized for multiple inputs/outputs just for your reference. You can modify the code to make it all synchronous based on your use case.

Thanks