I need help understanding the behavior of stream synchronization. I have the following code I use to run inference with an image classifier on frames captured from a camera in real-time.
def run_inference(self, context, bindings, inputs, outputs, stream, batch_size=1): # Transfer input data to the GPU. [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs] # Run inference. context.execute_async( batch_size=batch_size, bindings=bindings, stream_handle=stream.handle ) # Transfer predictions back from the GPU. [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs] # Synchronize the stream stream.synchronize() # Return only the host outputs. return [out.host for out in outputs]
However I can remove the call
stream.synchronize() and my throughput jumps to about double the original throughput. Without synchronization, calling
execute_async returns false, which I expect. I believe the speed-up is so significant because the GPU and CPU are able to work concurrently, and for my use-case, the CPU actually has to do a lot of image preprocessing before inputs are ready for inference.
What is happening under the hood (with calls to synchronize absent))? I’m guessing that following the
execute_async call, my outputs aren’t updating to reflect the most recent input data since the GPU is still processing asynchronously while the CPU is running code. I assume that the output is just the output of the most recent completed inference. So if I have a print statement to view the output, the output does not reflect the most recent frame that is queued for inference? I am re-using the same input/output buffers.
Is there any other drawback to removing the synchronize call that I’m missing? Am I flawed in my understanding?
Edit - Also, should I be making new buffers and a new stream per inference, or is it ok to re-use these buffers and stream?
Edit 2 - I’ve decided at least for now to call
stream.synchronize() just before new input is loaded onto device memory (which is before
execute_async). My throughput is lower but still higher than if
stream.synchronize() is called after
execute_async. I believe this ensures synchronization with the added benefit that image preprocessing on the CPU and inference on the GPU are concurrent. That is, while the GPU is running inference on the i-th input, the CPU is preprocessing the (i+1)-th input.