Understanding stream synchronization with TRT Inference


I need help understanding the behavior of stream synchronization. I have the following code I use to run inference with an image classifier on frames captured from a camera in real-time.

  def run_inference(self, context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    # Return only the host outputs.
    return [out.host for out in outputs]

However I can remove the call stream.synchronize() and my throughput jumps to about double the original throughput. Without synchronization, calling stream.is_done() after execute_async returns false, which I expect. I believe the speed-up is so significant because the GPU and CPU are able to work concurrently, and for my use-case, the CPU actually has to do a lot of image preprocessing before inputs are ready for inference.

What is happening under the hood (with calls to synchronize absent))? I’m guessing that following the execute_async call, my outputs aren’t updating to reflect the most recent input data since the GPU is still processing asynchronously while the CPU is running code. I assume that the output is just the output of the most recent completed inference. So if I have a print statement to view the output, the output does not reflect the most recent frame that is queued for inference? I am re-using the same input/output buffers.

Is there any other drawback to removing the synchronize call that I’m missing? Am I flawed in my understanding?

Edit - Also, should I be making new buffers and a new stream per inference, or is it ok to re-use these buffers and stream?

Edit 2 - I’ve decided at least for now to call stream.synchronize() just before new input is loaded onto device memory (which is before execute_async). My throughput is lower but still higher than if stream.synchronize() is called after execute_async. I believe this ensures synchronization with the added benefit that image preprocessing on the CPU and inference on the GPU are concurrent. That is, while the GPU is running inference on the i-th input, the CPU is preprocessing the (i+1)-th input.


stream.synchronize() is to hold the CPU until the memory copy is done.
Without calling it, you cannot make sure the buffer is ready although the performance will improve.

You don’t t need to call the synchronization right after the kernel call.
Some independent and CPU-only process can be inserted among the kernel and synchronization.
Just make sure the buffer is ready (via stream.synchronize()) before accessing via CPU.