data transfer cost a lot of time

I use tensorrt3.0 to accelerate my inference time. however, the time spend on moving image data to cuda memory and moving output from cuda memory to cpu cost a lot of time. I follow the example in jetson-inference, using cudaAllocMapped((void**)&mInputCPU, (void**)&mInputCUDA, inputSize) and cudaAllocMapped((void**)&outputCPU, (void**)&outputCUDA, outputSize).
What is the right way to moving data between host and device?

Hi,

It’s recommended to use unified memory:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd

Unified memory provides the same memory address for CPU/GPU.
This free user from implementing memory copy manually.

Thanks.