I am conducting AI training computations on a Jetson Orin NX device and have encountered an issue.
I modified several memory allocations from cudaMalloc to cudaHostAlloc (applying zero-copy) and observed a slight overhead when performing computations exclusively on the GPU. This overhead did not appear when using cudaMalloc.
While researching this issue, I found (this site) , which states that, starting from CUDA 9.X and Xavier, cache coherence between the CPU and GPU is handled at the CPU cache level, eliminating previous overheads.
Could you help me understand why the overhead still occurs when using cudaHostAlloc in my experiments? Is this related to the process of maintaining cache coherence at the CPU cache level?
Additionally, could you explain what happens internally when Zero-Copy occurs?
Pinned memory (cudaHostAlloc) is to allocate a page-locked buffer on the CPU.
Since the memory won’t be paged out, the pointer can share with the GPU directly without copying the data.
However, compared to cudaMalloc which is allocated on GPU, the access of CPU memory is expected to be slower.
(but just slightly slower as Jetson shares the same physical memory between CPU and GPU).
Thank you so much for your response!
May I ask an additional question?
I would like to understand why accessing page-locked memory on the CPU is slightly slower than accessing memory declared on the device, even though Jetson shares the same physical memory between the CPU and GPU.
I have checked the documentation, but I couldn’t find a precise explanation due to my lack of understanding.
This is more related to buffer address and handler but we are not able to disclosure too much here.
Jetson’s CPU and GPU have different address space although they share the same physical memory.
So when you allocate buffer on CPU, it will take some “overhead” to get the GPU address.