My platform is TX2.
I copied data from cuda to host by using cudaMemcpy().
cuda memory is allocated by cudaMalloc, host memory is allocated by using new. It takes about 10ms.
Then I tried another method by copying data from pinned memory to host by using memcpy().
pinned memory is allocated by cudaMallocHost, host memory is allocated by using new, it takes about 30ms.
I am confused here, GPU in TX2 doesn’t have its own memory, all memory can be regarded as CPU memory, so method 2 should take at most 10ms( let alone method 1 needs to do GPU mapping->pinned->host, method 2 only needs pinned->host)