Hi all,
I have been running my code on both non-unified memory setup (CPU + Hopper GPU) and an NVIDIA Grace Hopper system with Unified Memory. However, I am observing better performance on the non-unified Hopper GPU setup compared to Grace Hopper.
In my current approach, I allocate all memory on the CPU (using malloc
) and rely on Grace Hopper’s Unified Memory to migrate data between the CPU and GPU on-demand via page faults. However, I suspect this might not be the most efficient strategy. Would it be better to systematically allocate memory based on expected first access? That is:
- If the GPU is expected to access it first, allocate on the GPU.
- If the CPU is expected to access it first, allocate on the CPU.
Since my codebase is quite large, I opted for the simple strategy of always allocating on the CPU to avoid extensive modifications. However, this seems to be negatively impacting performance on Grace Hopper compared to a traditional CPU + Hopper GPU setup.
Additionally, is malloc
a better choice than cudaMallocManaged
or vica-versa on Grace Hopper for better performance and memory management ?
I would appreciate any advice or insights from the community!