During inference, stream.synchronize() is very slow. Is there any approach to get rid of it ?
Environment
TensorRT Version: 8.0.0.3 GPU Type: T4 Nvidia Driver Version: 450 CUDA Version: 11.0 CUDNN Version: 8.2.0 Operating System + Version: CENTOS7 Python Version (if applicable): 3.7.19 TensorFlow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if container which image + tag):
Relevant Files
Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
Steps To Reproduce
def do_inference(context, bindings, inputs, outputs, stream, batch_size):
# Transfer data from CPU to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
# Return only the host outputs.
return [out.host for out in outputs]
Data transfer between cpu & gpu and execution speed is both okay.
stream.synchronize() almost takes 90% percent of time in this function.
Hi @ycy1164656
By specifying a stream, the CUDA API calls become asynchronous, meaning that the call may return before the command has been completed. Memory transfer instructions and kernel invocation can use CUDA stream. stream.synchronize() synchronize all streams on a CUDA device.
Sample codes are generalized for multiple inputs/outputs just for your reference. You can modify the code to make it all synchronous based on your use case.