About sampleUffSSD.cpp: Why doesn't IExecutionContext::execute need a cudaStreamSynchronize before?

In sampleUffSSD.cpp, in doInference,
we have

// DMA the input to the GPU,  execute the batch asynchronously, and DMA it back:
CHECK(cudaMemcpyAsync(buffers[inputIndex], inputData, batchSize * INPUT_C * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));

auto t_start = std::chrono::high_resolution_clock::now();
context.execute(batchSize, &buffers[0]);
auto t_end = std::chrono::high_resolution_clock::now();

How does the code ensure that the copying to buffers[inputIndex] has completed before IExecutionContext::execute executes?

As https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#implicit-synchronization mentioned:
“any CUDA command to the NULL stream,” is an Implicit Synchronization (to the CUDA work on other streams).
“context.execute(batchSize, &buffers[0]);” runs on NULL stream, so it will cause a Synchronization to the cudaMemcpyAsync on other stream.