Running cuda-checkpoint on SGLang fails on H200, but succeeds on H100

This ticket demonstrates a usage of cuda-checkpoint which fails on H200s, but succeeds on H100s. In this reproduction, we are attempting to run cuda-checkpoint on and SGLang server.

Driver version: 575.57.08
CUDA version: 12.9

Steps to reproduce

Step 1: Build the Dockerfile

FROM lmsysorg/sglang:v0.5.0rc2-cu126

RUN apt-get update && apt-get install -y wget && \
    wget https://github.com/NVIDIA/cuda-checkpoint/raw/refs/heads/main/bin/x86_64_Linux/cuda-checkpoint -O /usr/bin/cuda-checkpoint && \
    chmod +x /usr/bin/cuda-checkpoint

RUN pip install protobuf

Then,

docker build . --tag sglang-server

Step 2: Run the container, and run cuda-checkpoint inside

#!/bin/bash

set -ex

log() {
  echo "[$(date +"%H:%M:%S")]" "$@" >&2
}

container_id="$(sudo docker run --detach --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    --name=sglang-server \
    sglang-server \
    python3 -m sglang.launch_server \
        --model-path "mistralai/Mistral-7B-Instruct-v0.3" \
        --host 0.0.0.0 --port 30000)"

sleep 3; log "checkpointing internally"

get_state() {
  sudo docker exec $container_id \
    sh -c "ps aux | grep -v grep | grep '^root' | grep -v cuda-checkpoint | awk '{ print \$2 }' | xargs -I{} cuda-checkpoint --get-state --pid {}" || true
}

toggle_all() {
  sudo docker exec $container_id \
    sh -c "ps aux | grep -v grep | grep '^root' | grep -v cuda-checkpoint | awk '{ print \$2 }' | xargs -I{} cuda-checkpoint --toggle --pid {}" || true
}

get_state
toggle_all
get_state

On an H200 machine, cuda-checkpoint fails with OS call failed or operation not supported on this OS when running --get-state and --toggle on certain PIDs spawned by SGLang. On an H100 machine, this command succeeds.

The original ticket can be found here.