Synchronized inference or Asynchronized inference

I’m new to cuda programming and also new to parallel computing. In terms of the inference execution in TensorRT, there are two ways, one is enqueue, which is asynchronously execution, the other is execute, which is synchronously. Does that mean if i use enqueue to inference a batch images (say 8) like below:

// So the buffers[inputIndex] contains batch image streams
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, 8 * INPUT_C * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
context.enqueue(8, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], 8 * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));

During execution, every image stream is inferenced one by one 8 times?

And if i use execute to inference them like this:

CHECK(cudaMemcpyAsync(buffers[inputIndex], input, 8 * INPUT_C * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
context.execute(8, buffers);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], 8 * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));

8 images are inferenced at the same time?

So where is the pros and cons of doing inference asynchronously, since synchronously doing inference may be obviously faster than asynchronously right?

Hello,

Asynchronous inference execution generally increases performance by overlapping compute as it maximizes GPU utilization.

The pros of using execute/sync api is simplicity/code executes inline. The challenge of enqueue/async api is the extra work of handling signal events when the buffers are in use/freed. But you should see more efficient GPU usage with async model.

In a typical use case, TensorRT will execute asynchronously. The enqueue() method will add kernels to a CUDA stream specified by the application. One of the enqueue() parameters is a cudaEvent which will be signaled when the input buffers are no longer in use and can be refilled.

You can find more information about enqueue function at:
http://docs.nvidia.com/deeplearning/sdk/tensorrt-user-guide/index.html#doinference
and