TensorRT copy data cost a lot of time

wangxiang2713 · April 7, 2020, 9:34am

Description

I convert a resNet v1_50 model to tensorRT and run in INT8 precision. When the batch_size is 16, infer costs 10.73ms/batch, however add copyInputToDevice() and copyOutputToHost(), it costs 14,88ms/batch, and TF-TRT model costs 13.08ms/batch(data transport within) .
I also tried copyInputToDeviceAsync() and copyOutputToHostAsync() and run model with context->enqueue(), however, the time cost doesn’t reduce.
Is there any way to reduce time of data transport? Thankyou very much!

Environment

TensorRT Version: 7.0:
GPU Type: T4:
Nvidia Driver Version: 410.79:
CUDA Version: 10.0:
CUDNN Version: 7.6.4:
Operating System + Version: Centos 7:
Python Version (if applicable): 2.7:
TensorFlow Version (if applicable): 1.15:

Code

const auto t_start = std::chrono::high_resolution_clock::now();

// Create CUDA stream for the execution of this inference
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));

// Asynchronously copy data from host input buffers to device input buffers
buffers.copyInputToDeviceAsync(stream);

// Asynchronously enqueue the inference work
if (!context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr))
{
    return false;
}

// Asynchronously copy data from device output buffers to host output buffers
buffers.copyOutputToHostAsync(stream);

// Wait for the work in the stream to complete
cudaStreamSynchronize(stream);

// Release stream
cudaStreamDestroy(stream);

const auto t_end = std::chrono::high_resolution_clock::now();

I wonder whether above codes can reduce data transport time, are these codes right?

SunilJB · April 8, 2020, 6:54am

Hi,

You need to use pinned memory instead of global memory using “cudaMallocHost” and use it with stream to optimize the data transfer.
Also, scheduling requests in separate streams allows work to be scheduled immediately as the hardware becomes available without unnecessary synchronization.

Please refer to below link and sample:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#optimize-performance
https://github.com/NVIDIA/TensorRT/blob/07ed9b57b1ff7c24664388e5564b17f7ce2873e5/samples/opensource/sampleMovieLensMPS/sampleMovieLensMPS.cpp

Thanks

Topic		Replies	Views
inference time of tensorrt is slower than tensorflow !!! TensorRT	2	1440	September 27, 2019
Stream.synchronize() is slow (python API) TensorRT	5	2215	August 24, 2021
How can I optimize multi-batch and parallel inference in TensorRT for faster performance on high-resolution image patches? TensorRT tensorrt , cuda , ubuntu , python , cudnn , deep-learning	2	96	December 2, 2024
Cuda transfer from device to host is extremely slow TensorRT cuda	5	2602	February 13, 2022
TensorRT 5.X / 6.X Batch Size Problem TensorRT	4	608	August 19, 2020
Is the inference cost time affected by the frequency of calls? TensorRT	2	368	November 25, 2020
Transfer data from GPU->CPU takes too much time. TensorRT	0	557	May 23, 2019
Transfer data from GPU->CPU takes too much time. TensorRT	0	246	May 23, 2019
Transfer data from GPU->CPU takes too much time. TensorRT	0	316	May 23, 2019
Transfer data from GPU->CPU takes too much time. TensorRT	0	252	May 23, 2019

TensorRT copy data cost a lot of time

Description

Environment

Code

Related topics