copy from pinned memory to host is 3x slower than copy from cuda to host, why?

My platform is TX2.

I copied data from cuda to host by using cudaMemcpy().
cuda memory is allocated by cudaMalloc, host memory is allocated by using new. It takes about 10ms.

Then I tried another method by copying data from pinned memory to host by using memcpy().
pinned memory is allocated by cudaMallocHost, host memory is allocated by using new, it takes about 30ms.

I am confused here, GPU in TX2 doesn’t have its own memory, all memory can be regarded as CPU memory, so method 2 should take at most 10ms( let alone method 1 needs to do GPU mapping->pinned->host, method 2 only needs pinned->host)


You can check our document for the memory system on Jetson.

Since TX2 doesn’t support I/O coherency, the CPU access time of pinned memory can cause unpredictable latencies in the application.