I’m new to cuda programming and also new to parallel computing. In terms of the inference execution in TensorRT, there are two ways, one is enqueue, which is asynchronously execution, the other is execute, which is synchronously. Does that mean if i use enqueue to inference a batch images (say 8) like below:
// So the buffers[inputIndex] contains batch image streams
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, 8 * INPUT_C * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
context.enqueue(8, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], 8 * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
During execution, every image stream is inferenced one by one 8 times?
And if i use execute to inference them like this:
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, 8 * INPUT_C * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
context.execute(8, buffers);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], 8 * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
8 images are inferenced at the same time?
So where is the pros and cons of doing inference asynchronously, since synchronously doing inference may be obviously faster than asynchronously right?