TensorRT copy data cost a lot of time


I convert a resNet v1_50 model to tensorRT and run in INT8 precision. When the batch_size is 16, infer costs 10.73ms/batch, however add copyInputToDevice() and copyOutputToHost(), it costs 14,88ms/batch, and TF-TRT model costs 13.08ms/batch(data transport within) .
I also tried copyInputToDeviceAsync() and copyOutputToHostAsync() and run model with context->enqueue(), however, the time cost doesn’t reduce.
Is there any way to reduce time of data transport? Thankyou very much!


TensorRT Version: 7.0:
GPU Type: T4:
Nvidia Driver Version: 410.79:
CUDA Version: 10.0:
CUDNN Version: 7.6.4:
Operating System + Version: Centos 7:
Python Version (if applicable): 2.7:
TensorFlow Version (if applicable): 1.15:


const auto t_start = std::chrono::high_resolution_clock::now();

// Create CUDA stream for the execution of this inference
cudaStream_t stream;

// Asynchronously copy data from host input buffers to device input buffers

// Asynchronously enqueue the inference work
if (!context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr))
    return false;

// Asynchronously copy data from device output buffers to host output buffers

// Wait for the work in the stream to complete

// Release stream

const auto t_end = std::chrono::high_resolution_clock::now();

I wonder whether above codes can reduce data transport time, are these codes right?


You need to use pinned memory instead of global memory using “cudaMallocHost” and use it with stream to optimize the data transfer.
Also, scheduling requests in separate streams allows work to be scheduled immediately as the hardware becomes available without unnecessary synchronization.

Please refer to below link and sample: