cudaMemcpyAsync() cost time is same with cudaMemcpy()

When I use TensorRT5.0 to inference a image, I mesure the cudaMemcpyAsync time,and I repalce it with cudaMemcpy,the cost time is same.

auto t_start = std::chrono::high_resolution_clock::now();
    for(int i=0;i<10;i++)
    CHECK(cudaMemcpyAsync(m_gpu_buffers[m_next_ready_stream_index*m_nbBindings+m_inputIndex], inputData, m_batches * m_INPUT_C * m_INPUT_H * m_INPUT_W * sizeof(float), cudaMemcpyHostToDevice, m_vecstreams[m_next_ready_stream_index]));

    auto t_Memend = std::chrono::high_resolution_clock::now();
    printf("   Mem Ascpy cost %.3f\n",std::chrono::duration<float, std::milli>(t_Memend - t_start).count());

No matter what it will take 1.17 times,and if m_batches * m_INPUT_C * m_INPUT_H * m_INPUT_W becomes 6 times,the time will larger about 6 times.
It looks like not overlape.
if it can overlape,the cudaMemcpyAsync() cost time should no change,is’t it?

by the way inputData is allocate with cudaMallocHost()

I find the mistake. I don’t allocate inputData with cudaMallocHost() in code.