vLLM v0.8.4 shows UVM GPU1 BH process with high utilization

Hi,

I have a application mode that uses UVA implementation of vLLM for CPU offloading on a GH200. When I run this, I see an unexplained process called UVM GPU1 BH, and also I do not see any profiled information about page faults using nsys profile --cpu-core-metrics=0,2,14 --gpu-metrics-device=all --cuda-um-cpu-page-faults=true --cuda-um-gpu-page-faults=true --event-sample=system-wide

Nsys rep attachment

I’m not sure what the “UVM GPU1 BH” process is referring to since it shows high utilization. Has anyone encountered this?

Thanks!

The implementation doesn’t appear to be making use of managed memory (where page faults might occur):

I try to re-implement CPU offloading in a fully transparent way: we offload the tensor to CPU, and let GPU directly view it as GPU tensor. It depends on UVA technology (no clear documentation, but there’re some public discussions), and per my discussion with nvidia experts, it works for systems with pinned memory.

I don’t have any info on the UVM GPU1 BH process, but it doesn’t appear to be unique to anything you’ve mentioned.

  1. I do see cudaHostAlloc() calls and an explicit mention that the implementation uses UVA. Does this mean the data is being access by device directly from the host CPU at a cacheline granularity over the NVLINK C2C on the GH200?

  2. Is it safe to say that any allocation with cudaMallocManaged() or malloc() will be what is subject to the page fault based automatic migration on the GH200?

cudaHostAlloc == pinned memory

I’m fairly certain pinned memory is not subject to demand-paged migration under any circumstance.

cudaMallocManaged is definitely the typical path to create an allocation that is demand-paged migratable (in many settings).

malloc() is typically not demand-paged migratable, but on GH200 it should be.

It looks to me like the vLLM implementation/PR did not have only GH in view, i.e. it was intended to be usable on “any” CUDA capable setup. That would be a probable motivation to use pinned memory for the objectives stated there. If they had used malloc() and then run it on a non-Grace system, it would just fail.

It may not be related to OP’s issue, but I think that HMM is enabled now on all platforms that have a compatible kernel (6.8+ iirc). We have an AMD/H100 system for which my basic HMM test works. On prior kernels it would fail.

// olmalloc_mode determines which allocator we use
// 0 is malloc (hmm only!) (default)
// 1 is cudaMallocManaged
static int olmalloc_mode = 0;
void* olmalloc(size_t bytes)
{
    void* foo = nullptr;
    if (!olmalloc_mode)
    {
        foo = malloc(bytes);
    }
    else if (olmalloc_mode == 1)
    {
        cudaMallocManaged(&foo, bytes);
    }
    else
    {
        fprintf(stderr, "invalid olmalloc mode %d\n", olmalloc_mode);
        exit(1);
    }
    if (foo == nullptr)
    {
        fprintf(stderr, "olmalloc failed\n");
        exit(1);
    }
    return foo;
}

__global__ void warmup()
{
    return;
}

__global__ void inc(int* a, size_t n)
{
    size_t i = blockDim.x * blockIdx.x + threadIdx.x;
    for (; i < n; i += blockDim.x * gridDim.x)
    {
        a[i] += 1;
    }
}

#define TPB 256
int main(int argc, char* argv[]) {
    size_t N;
    if (argc != 2 && argc != 3)
    {
        std::cerr << "Usage: " << argv[0] << " <size_t>\n";
        return 1;
    }
    try
    {
        N = std::stoul(argv[1]);
        if (argc == 3)
        {
            olmalloc_mode = std::stoi(argv[2]);
        }
    }
    catch (const std::invalid_argument& e) 
    {
        std::cerr << "Invalid argument: the input is not an unsigned integer.\n";
        return 2;
    }
    catch (const std::out_of_range& e) 
    {
        std::cerr << "Invalid argument: the input is out of range for a size_t.\n";
        return 3;
    }

    const int blocks = calculateOptimalBlocks(inc, TPB);
    int* a = (int*) olmalloc(sizeof(int) * N);

    printf("Allocating %lu bytes\n", sizeof(int) * N);
    printf("Allocating %lf gigabytes\n", sizeof(int) * N / 1e9);
    printf("Kernel Config: %d, %d\n", blocks, TPB);

    for (size_t i = 0; i < N; ++i)
    {
        a[i] = 3;
    }

    warmup<<<1, 1>>>();
    cudaDeviceSynchronize();

    auto start = std::chrono::high_resolution_clock::now();

    inc<<<blocks, TPB>>>(a, N);
    cudaDeviceSynchronize();

    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
    std::cerr << duration.count()/1e9 << std::endl;
    std::cout << "runtime: " << duration.count()/1e9 << " seconds" << std::endl;


    for (int i = 0; i < 10; ++i)
    {
        printf("a[%d] = %d\n", i, a[i]);
    }
    for (size_t i = 0; i < N; ++i)
    {
        assert(a[i] == 4);
    }
}
tallen93@cci-hopper1:~/dev/hmm-eval/benchmarks/microbench/basic$ ./basic 50 1
Allocating 200 bytes
Allocating 0.000000 gigabytes
Kernel Config: 912, 256
0.000160561
runtime: 0.000160561 seconds
a[0] = 4
a[1] = 4
a[2] = 4
a[3] = 4
a[4] = 4
a[5] = 4
a[6] = 4
a[7] = 4
a[8] = 4
a[9] = 4
tallen93@cci-hopper1:~/dev/hmm-eval/benchmarks/microbench/basic$ ./basic 50 0
Allocating 200 bytes
Allocating 0.000000 gigabytes
Kernel Config: 912, 256
0.00033388
runtime: 0.00033388 seconds
a[0] = 4
a[1] = 4
a[2] = 4
a[3] = 4
a[4] = 4
a[5] = 4
a[6] = 4
a[7] = 4
a[8] = 4
a[9] = 4

yes, correct. With either HMM or ATS in effect, then host system ordinary allocator memory (malloc()) can be accessed from device code. ATS is the mechanism that enables Grace Hopper, and HMM may be enabled (in non-ATS cases) depending on how your system is set up.

Since this thread was about vLLM, it appears that PyTorch does not have Unified Memory support. So there is no UVM but only UVA and explicit memcopies.

It is still mysterious what the “UVM GPU BH1” process is carrying out.

There are multiple requests for UVM however there does not seem to be support yet.