Performance decrease on Unified GracehopperC

Hi all,

I have been running my code on both non-unified memory setup (CPU + Hopper GPU) and an NVIDIA Grace Hopper system with Unified Memory. However, I am observing better performance on the non-unified Hopper GPU setup compared to Grace Hopper.

In my current approach, I allocate all memory on the CPU (using malloc) and rely on Grace Hopper’s Unified Memory to migrate data between the CPU and GPU on-demand via page faults. However, I suspect this might not be the most efficient strategy. Would it be better to systematically allocate memory based on expected first access? That is:

  • If the GPU is expected to access it first, allocate on the GPU.
  • If the CPU is expected to access it first, allocate on the CPU.

Since my codebase is quite large, I opted for the simple strategy of always allocating on the CPU to avoid extensive modifications. However, this seems to be negatively impacting performance on Grace Hopper compared to a traditional CPU + Hopper GPU setup.

Additionally, is malloc a better choice than cudaMallocManaged or vica-versa on Grace Hopper for better performance and memory management ?

I would appreciate any advice or insights from the community!

This may be of interest.

Try using cudaMemPrefetchAsync where you would normally use cudaMemcpy.

Otherwise, if you just rely on page faults, the pages are just going to get fetched one 8K page at a time.

Remember on Grace Hopper, the pages still need to be migrated over NVLINK, it’s faster than PCIe but not zero cost.

@tonywu93 is right!

This blogpost explains it further

Even with very sophisticated driver prefetching heuristics, on-demand access with migration will never beat explicit bulk data copies or prefetches in terms of performance for large contiguous memory regions. This is the price for simplicity and ease of use. If the application’s access pattern is well defined and structured you should prefetch usingcudaMemPrefetchAsync

The GH200 whitepaper, however does not clarify what @tonywu93 accurately stated above. The performance gain is attributed to ATS when it should actually be NVLINK instead

The performance impact is projected in Figure 18 and highlights how applications transparently benefit from Grace Hopper features like ATS without any application-side changes.