When I use TensorRT5.0 to inference a image, I mesure the cudaMemcpyAsync time,and I repalce it with cudaMemcpy,the cost time is same.
auto t_start = std::chrono::high_resolution_clock::now();
for(int i=0;i<10;i++)
CHECK(cudaMemcpyAsync(m_gpu_buffers[m_next_ready_stream_index*m_nbBindings+m_inputIndex], inputData, m_batches * m_INPUT_C * m_INPUT_H * m_INPUT_W * sizeof(float), cudaMemcpyHostToDevice, m_vecstreams[m_next_ready_stream_index]));
auto t_Memend = std::chrono::high_resolution_clock::now();
printf(" Mem Ascpy cost %.3f\n",std::chrono::duration<float, std::milli>(t_Memend - t_start).count());
No matter what it will take 1.17 times,and if m_batches * m_INPUT_C * m_INPUT_H * m_INPUT_W becomes 6 times,the time will larger about 6 times.
It looks like not overlape.
if it can overlape,the cudaMemcpyAsync() cost time should no change,is’t it?
by the way inputData is allocate with cudaMallocHost()