Description
I convert a resNet v1_50 model to tensorRT and run in INT8 precision. When the batch_size is 16, infer costs 10.73ms/batch, however add copyInputToDevice() and copyOutputToHost(), it costs 14,88ms/batch, and TF-TRT model costs 13.08ms/batch(data transport within) .
I also tried copyInputToDeviceAsync() and copyOutputToHostAsync() and run model with context->enqueue(), however, the time cost doesn’t reduce.
Is there any way to reduce time of data transport? Thankyou very much!
Environment
TensorRT Version: 7.0:
GPU Type: T4:
Nvidia Driver Version: 410.79:
CUDA Version: 10.0:
CUDNN Version: 7.6.4:
Operating System + Version: Centos 7:
Python Version (if applicable): 2.7:
TensorFlow Version (if applicable): 1.15:
Code
const auto t_start = std::chrono::high_resolution_clock::now();
// Create CUDA stream for the execution of this inference
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
// Asynchronously copy data from host input buffers to device input buffers
buffers.copyInputToDeviceAsync(stream);
// Asynchronously enqueue the inference work
if (!context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr))
{
return false;
}
// Asynchronously copy data from device output buffers to host output buffers
buffers.copyOutputToHostAsync(stream);
// Wait for the work in the stream to complete
cudaStreamSynchronize(stream);
// Release stream
cudaStreamDestroy(stream);
const auto t_end = std::chrono::high_resolution_clock::now();
I wonder whether above codes can reduce data transport time, are these codes right?