Using ATS on GH200

As I understand, the GH200 supports memory allocation where malloc() allocation can be accessed either by the CPU or the GPU. When I query the addressing mode under nvidia-smi, I see that it states ATS and not HMM on the following system.

$ nvidia-smi -q | grep Addressing
Addressing Mode : ATS

$uname -r 
5.14.0-362.24.1.el9_3.aarch64+64k

$nvcc --version
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Based on this HMM blog by Nvidia, both HMM and ATS should have similar performance on the GH200?

However, I notice significant performance degradation when I replace cudaMalloc() with a malloc() when running LLM inference using llama.cpp

What could be the underlying reason for it? Where can I find instructions on migrating applications that use cudaMalloc()/cudaMallocManaged() to ATS systems like GH200? All the Nvidia articles seem to talk about HMM and not ATS

cudaMalloc locates its allocation in GPU device memory. In a GH200 this is the 80GB (or more) of HBM memory that has a peak theoretical bandwidth of 4TB/s (or more). When using GPU device code, such as an LLM inference running on the GPU, access to device memory does not require any translation service. Neither HMM nor ATS are involved.

malloc locates its allocation in the CPU memory (i.e. host memory). When GPU device code accesses such data, either HMM or ATS is required (otherwise, a pointer returned by the host allocator malloc is inaccessible from device code, and attempting to use it in device code anyway would most likely eventually result in a machine fault.) In the case of GH200, the CPU peak theoretical memory bandwidth is on the order of 500GB/s, and furthermore the usage of ATS means that the data will flow over the NVLink-C2C link between CPU and GPU, which has a peak theoretical bandwidth of 900GB/s.

So a 4x or more slowdown in data access speed could be witnessed, when comparing these two cases. That could have a significant effect on observed application performance.

In general, “migrating” an application to GH200 is not likely to follow a recipe such as “replace every use of cudaMalloc with malloc”. In fact, to a first order approximation, migrating an application that already runs in a GPU accelerated way may have no particular instructions or refactoring needed with respect to the GPU activity. (There may be some “porting” needed if the application is designed for x86-64 and needs conversion to ARM used by the Grace CPU.)

An application already designed for GPU usage probably respects an 80GB limit for GPU device memory, since it was only recently that CUDA GPUs with more than that amount of memory became available. Apart from the CPU considerations, its not likely that the GPU code requires refactoring just to migrate to GH200. But the Grace CPU memory (~480GB), accessible at ~500GB/s, might become an interesting target for applications that are shuttling data between CPU and GPU. This is less likely to be a descriptor applied to LLM inference, but could be a descriptor applied to LLM training, or perhaps other GPU accelerated applications such as recommender systems training.

A possible starting point for additional information could be this tuning guide which mostly focuses on considerations for the Grace CPU, but also links to recommended resources for Hopper.

This seems incorrect, going by how ATS is defined in the whitepaper.

ATS as mentioned in the tuning guide is designed to exactly prevent this sort of performance deteriorations from frequent page fault by migrating the physical backing memory. From a programmer perspective, as I understand, malloc() and cudaMalloc() should have similar performance due to ATS on the GH200.

In this case even with the malloc(), the physical page should reside on the HBM, making the full memory bandwidth available if it’s the GPU-alone accessing these regions.

Quoting (also indicated in Fig 8) from the whitepaper

System-allocated memory is migratable, i.e., the runtime can change its physical memory backing to improve application performance (Figure 9) or deal with memory pressure.

The differences between HMM and ATS are not very clear since most of the programmer guides only talk about HMM and not ATS in the context of Grace Hopper.

llama.cpp supports GPU offloading on a layer basis. In this mode, the inference does not have to exclusively run on the GPU.

speaking for myself, even if we posit migration, I would not suppose that migrating data on first touch (best imaginable case for malloc()) is as fast as non-migrated access via cudaMalloc() in every respect. It’s not plausible to me. I do agree that once the page is migrated, access to data there is equivalent performance-wise to any other HBM access.

In a nutshell, ATS is a hardware mechanism for page mapping and translation. HMM is a software mechanism to do the same thing. Grace has ATS built into the hardware (related to the NVLink-C2C bus), so it can take advantage of that. For an x86_64 system, HMM would be the only option. I don’t know of performance comparison between the two cases, but my general expectation is that ATS is faster (and may have more utility) than HMM.

If my comments seem doubtful to you, another approach would be to use a profiler to understand the difference between the two cases. Since the only change you indicated you have made is to convert cudaMalloc to malloc, presumably that would be a clue to investigate; presumably the slowdown relates to using those allocations, somehow.

This blogpost clarified this for me.

Even with very sophisticated driver prefetching heuristics, on-demand access with migration will never beat explicit bulk data copies or prefetches in terms of performance for large contiguous memory regions. This is the price for simplicity and ease of use. If the application’s access pattern is well defined and structured you should prefetch usingcudaMemPrefetchAsync

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.