data transfer cost a lot of time

I use tensorrt3.0 to accelerate my inference time. however, the time spend on moving image data to cuda memory and moving output from cuda memory to cpu cost a lot of time. I follow the example in jetson-inference, using cudaAllocMapped((void**)&mInputCPU, (void**)&mInputCUDA, inputSize) and cudaAllocMapped((void**)&outputCPU, (void**)&outputCUDA, outputSize).
What is the right way to moving data between host and device?


It’s recommended to use unified memory:

Unified memory provides the same memory address for CPU/GPU.
This free user from implementing memory copy manually.