TensorRT copy data cost a lot of time

Description

I convert a resNet v1_50 model to tensorRT and run in INT8 precision. When the batch_size is 16, infer costs 10.73ms/batch, however add copyInputToDevice() and copyOutputToHost(), it costs 14,88ms/batch, and TF-TRT model costs 13.08ms/batch(data transport within) .
I also tried copyInputToDeviceAsync() and copyOutputToHostAsync() and run model with context->enqueue(), however, the time cost doesn’t reduce.
Is there any way to reduce time of data transport? Thankyou very much!

Environment

TensorRT Version: 7.0:
GPU Type: T4:
Nvidia Driver Version: 410.79:
CUDA Version: 10.0:
CUDNN Version: 7.6.4:
Operating System + Version: Centos 7:
Python Version (if applicable): 2.7:
TensorFlow Version (if applicable): 1.15:

Code

const auto t_start = std::chrono::high_resolution_clock::now();

// Create CUDA stream for the execution of this inference
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));

// Asynchronously copy data from host input buffers to device input buffers
buffers.copyInputToDeviceAsync(stream);

// Asynchronously enqueue the inference work
if (!context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr))
{
    return false;
}

// Asynchronously copy data from device output buffers to host output buffers
buffers.copyOutputToHostAsync(stream);

// Wait for the work in the stream to complete
cudaStreamSynchronize(stream);

// Release stream
cudaStreamDestroy(stream);

const auto t_end = std::chrono::high_resolution_clock::now();

I wonder whether above codes can reduce data transport time, are these codes right?

Hi,

You need to use pinned memory instead of global memory using “cudaMallocHost” and use it with stream to optimize the data transfer.
Also, scheduling requests in separate streams allows work to be scheduled immediately as the hardware becomes available without unnecessary synchronization.

Please refer to below link and sample:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#optimize-performance
https://github.com/NVIDIA/TensorRT/blob/07ed9b57b1ff7c24664388e5564b17f7ce2873e5/samples/opensource/sampleMovieLensMPS/sampleMovieLensMPS.cpp

Thanks