I was trying to run my model on Jatson AGX Xavier devkit.
When I was using
cudaMemcpy and recording the latency. I got this result:
copy data from gpu to cpu: 22 ms
inference: 28 ms
As is known, GPU and CPU share the same phisical memory on Xavier. So I changed to use the zero-copy method( following this blog: http://arrayfire.com/zero-copy-on-tegra-k1 )
void *cpu_data, *gpu_data; cudaSetDeviceFlags(cudaDeviceMapHost); cudaHostAlloc(&cpu_data, count * sizeof(float), cudaHostAllocMapped); cudaHostGetDevicePointer(&gpu_data, cpu_data, 0);
I used the code above to get a
cpu_data pointer and a
gpu_data pointer and loaded feature data to
cpu_data pointer and directly use
gpu_data for GPU inference. The latency of inference became 50 ms. That was just the total latency of data copy and inference when using
And when the first time executing the inference, the latency is 28 ms. But it increased to 50 ms since the second inference.
Is the data copy avoidable? Where was wrong?