copy from pinned memory to host is 3x slower than copy from cuda to host, why?

heyworld · October 18, 2018, 6:52am

My platform is TX2.

Method1
I copied data from cuda to host by using cudaMemcpy().
cuda memory is allocated by cudaMalloc, host memory is allocated by using new. It takes about 10ms.

Method2
Then I tried another method by copying data from pinned memory to host by using memcpy().
pinned memory is allocated by cudaMallocHost, host memory is allocated by using new, it takes about 30ms.

I am confused here, GPU in TX2 doesn’t have its own memory, all memory can be regarded as CPU memory, so method 2 should take at most 10ms( let alone method 1 needs to do GPU mapping->pinned->host, method 2 only needs pinned->host)

AastaLLL · October 18, 2018, 9:12am

Hi,

You can check our document for the memory system on Jetson.
[url]https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management[/url]

Since TX2 doesn’t support I/O coherency, the CPU access time of pinned memory can cause unpredictable latencies in the application.

Thanks.

Topic		Replies	Views
copy from pinned memory to host is 3x slower than copy from cuda to host, why? Jetson TX2	1	497	October 18, 2018
Pinned memory slows CPU computation Jetson TK1	5	1409	January 8, 2016
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1712	October 18, 2021
The memory sharing between cpu and gpu in Jetson TX2 Jetson TX2	6	7006	October 18, 2021
CPU operation is very slow on memory allocated by cudaMallocHost TensorRT	1	827	October 8, 2018
CPU operation is very slow on memory allocated by cudaMallocHost CUDA Programming and Performance	0	380	October 9, 2018
Memory copy improvement ? CUDA Programming and Performance	6	3072	April 25, 2012
CUDA memory performance Jetson TK1	3	1119	October 18, 2021
Issue with cuda pinned memory on Tegra K1(XiaoMi pad) CUDA Programming and Performance	1	925	January 21, 2015
Cuda memory copy throughput in jetson device Jetson AGX Xavier cuda	2	378	June 15, 2022

copy from pinned memory to host is 3x slower than copy from cuda to host, why?

Related topics