Zero-copy still copy data?

liujie · December 5, 2019, 8:59am

I was trying to run my model on Jatson AGX Xavier devkit.

When I was using cudaMalloc and cudaMemcpy and recording the latency. I got this result:

copy data from gpu to cpu: 22 ms
inference: 28 ms

As is known, GPU and CPU share the same phisical memory on Xavier. So I changed to use the zero-copy method( following this blog: http://arrayfire.com/zero-copy-on-tegra-k1 )

    void *cpu_data, *gpu_data;
    cudaSetDeviceFlags(cudaDeviceMapHost);
    cudaHostAlloc(&cpu_data, count * sizeof(float), cudaHostAllocMapped);
    cudaHostGetDevicePointer(&gpu_data, cpu_data, 0);

I used the code above to get a cpu_data pointer and a gpu_data pointer and loaded feature data to cpu_data pointer and directly use gpu_data for GPU inference. The latency of inference became 50 ms. That was just the total latency of data copy and inference when using cudaMalloc and cudaMemcpy.

And when the first time executing the inference, the latency is 28 ms. But it increased to 50 ms since the second inference.

Is the data copy avoidable? Where was wrong?

liujie · December 5, 2019, 9:32am

It seems the inference step (I used enqueue and cudaStreamSynchronize) synchronized data. If so, how should I synchronize the inference result back to CPU? Do I need something like cudaStreamSynchronize? and how could I get the latency for this sync-up? My output size is as twice as the input. Does that mean I need 44 ms to copy the result back to cpu memory? My God!!!

Honey_Patouceul · December 5, 2019, 10:58am

Pinned memory has no cache. You may try unified memory instead. See this post for an example.

liujie · December 5, 2019, 12:35pm

Thanks a lot, I will try that. But I don’t understand why we need cache here. I was processing different examples and each example was processed only once. Or do you mean the CPU cache which improve the memory read/write speed?

AastaLLL · December 16, 2019, 8:18am

Hi,

Pinned memory can be shared between CPU and GPU but the performance may not always fast.

The zero-copy allocates the physical location of memory is pinned in the CPU system memory.
So, a program may have fast or slow access to it depending on where it is being accessed from.

It’s recommended to use unified memory instead.
CUDA driver can automatically handle the synchronization and pick a better location for you.

Here is our document for Jetson memory system for your reference:
https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management

Thanks.

hazelnutvt04 · July 2, 2020, 2:31pm

In the example ‘Porting the code on Tegra’ code here:

Why is the NULL stream used, rather than some non-default stream?
// Porting the code on Tegra int main() { int *h_a,*d_b,*d_c,*h_d; int height = 1024; int width = 1024; size_t sizeOfImage = width * height * sizeof(int); // 4MB image


//Unified memory allocated for input and output 
//buffer of application pipeline
cudaMallocManaged(h_a, sizeOfImage,cudaMemAttachHost);
cudaMallocManaged(h_d, sizeOfImage);

//Intermediate buffers not needed on CPU side. 
//So allocate them on device memory
cudaMalloc(&d_b, sizeOfImage);
cudaMalloc(&d_c, sizeOfImage);

//CPU reads Image;
readImage (h_a); // Intialize the h_a buffer
// ----- CUDA Application pipeline start ----
// Prefetch input image data to GPU
cudaStreamAttachMemAsync(NULL, h_a, 0, cudaMemAttachGlobal);
k1<<<..>>>(h_a,d_b)
k2<<<..>>>(d_b,d_c)
k3<<<..>>>(d_c,h_d)
// Prefetch output image data to CPU
cudaStreamAttachMemAsync(NULL, h_d, 0, cudaMemAttachHost);
cudaStreamSynchronize(NULL);
// ----- CUDA Application pipeline end ----

// Use processed Image i.e h_d on CPU side.
UseImageonCPU(h_d);

}

kayccc · July 9, 2020, 6:25am

Hi hazelnutvt04,

Please open a new topic for this issue. Thanks

Topic		Replies	Views
The Zero Copy Shared memory mode consumes more CPU resources (jetson Xavier NX) Jetson Xavier NX tensorrt , cuda , cudnn	6	67	January 6, 2025
AGX Xavier -> Unified Memory questions Jetson AGX Xavier cuda	2	1040	June 25, 2021
Zero-Copy and Managed memory on Jetson Jetson TX1	9	11685	August 20, 2018
Unified Memory has poor performance on Jetson AGX Xavier Jetson AGX Xavier cuda	6	1159	February 9, 2022
Best hardware options to reduce GPU and CPU memory transfer time? Jetson Nano	6	1061	January 19, 2022
GPU data speed Jetson TX2 cuda	8	1004	October 18, 2021
RE: Performance issues after refactoring CUDA code to avoid managed memory Jetson AGX Xavier cuda	4	41	November 25, 2024
How to disable zero-copy on TX1? Jetson TX1	4	762	October 18, 2021
CUDA memory performance Jetson TK1	3	1124	October 18, 2021
CUDA on Jetson Xavier AGX Jetson AGX Xavier cuda	4	918	December 22, 2021

Zero-copy still copy data?

Related topics