Zero-copy still copy data?

I was trying to run my model on Jatson AGX Xavier devkit.

When I was using cudaMalloc and cudaMemcpy and recording the latency. I got this result:

copy data from gpu to cpu: 22 ms
inference: 28 ms

As is known, GPU and CPU share the same phisical memory on Xavier. So I changed to use the zero-copy method( following this blog: http://arrayfire.com/zero-copy-on-tegra-k1 )

    void *cpu_data, *gpu_data;
    cudaSetDeviceFlags(cudaDeviceMapHost);
    cudaHostAlloc(&cpu_data, count * sizeof(float), cudaHostAllocMapped);
    cudaHostGetDevicePointer(&gpu_data, cpu_data, 0);

I used the code above to get a cpu_data pointer and a gpu_data pointer and loaded feature data to cpu_data pointer and directly use gpu_data for GPU inference. The latency of inference became 50 ms. That was just the total latency of data copy and inference when using cudaMalloc and cudaMemcpy.

And when the first time executing the inference, the latency is 28 ms. But it increased to 50 ms since the second inference.

Is the data copy avoidable? Where was wrong?

It seems the inference step (I used enqueue and cudaStreamSynchronize) synchronized data. If so, how should I synchronize the inference result back to CPU? Do I need something like cudaStreamSynchronize? and how could I get the latency for this sync-up? My output size is as twice as the input. Does that mean I need 44 ms to copy the result back to cpu memory? My God!!!

Pinned memory has no cache. You may try unified memory instead. See this post for an example.

Thanks a lot, I will try that. But I don’t understand why we need cache here. I was processing different examples and each example was processed only once. Or do you mean the CPU cache which improve the memory read/write speed?

Hi,

Pinned memory can be shared between CPU and GPU but the performance may not always fast.

The zero-copy allocates the physical location of memory is pinned in the CPU system memory.
So, a program may have fast or slow access to it depending on where it is being accessed from.

It’s recommended to use unified memory instead.
CUDA driver can automatically handle the synchronization and pick a better location for you.

Here is our document for Jetson memory system for your reference:
https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management

Thanks.

In the example ‘Porting the code on Tegra’ code here:

Why is the NULL stream used, rather than some non-default stream?

// Porting the code on Tegra
int main()
{
int *h_a,*d_b,*d_c,*h_d;
int height = 1024;
int width = 1024;
size_t sizeOfImage = width * height * sizeof(int); // 4MB image

//Unified memory allocated for input and output 
//buffer of application pipeline
cudaMallocManaged(h_a, sizeOfImage,cudaMemAttachHost);
cudaMallocManaged(h_d, sizeOfImage);

//Intermediate buffers not needed on CPU side. 
//So allocate them on device memory
cudaMalloc(&d_b, sizeOfImage);
cudaMalloc(&d_c, sizeOfImage);

//CPU reads Image;
readImage (h_a); // Intialize the h_a buffer
// ----- CUDA Application pipeline start ----
// Prefetch input image data to GPU
cudaStreamAttachMemAsync(NULL, h_a, 0, cudaMemAttachGlobal);
k1<<<..>>>(h_a,d_b)
k2<<<..>>>(d_b,d_c)
k3<<<..>>>(d_c,h_d)
// Prefetch output image data to CPU
cudaStreamAttachMemAsync(NULL, h_d, 0, cudaMemAttachHost);
cudaStreamSynchronize(NULL);
// ----- CUDA Application pipeline end ----

// Use processed Image i.e h_d on CPU side.
UseImageonCPU(h_d);

}

Hi hazelnutvt04,

Please open a new topic for this issue. Thanks