This ticket demonstrates a usage of cuda-checkpoint which fails on H200s, but succeeds on H100s. In this reproduction, we are attempting to run cuda-checkpoint on and SGLang server.
Driver version: 575.57.08
CUDA version: 12.9
Steps to reproduce
Step 1: Build the Dockerfile
FROM lmsysorg/sglang:v0.5.0rc2-cu126
RUN apt-get update && apt-get install -y wget && \
wget https://github.com/NVIDIA/cuda-checkpoint/raw/refs/heads/main/bin/x86_64_Linux/cuda-checkpoint -O /usr/bin/cuda-checkpoint && \
chmod +x /usr/bin/cuda-checkpoint
RUN pip install protobuf
Then,
docker build . --tag sglang-server
Step 2: Run the container, and run cuda-checkpoint inside
#!/bin/bash
set -ex
log() {
echo "[$(date +"%H:%M:%S")]" "$@" >&2
}
container_id="$(sudo docker run --detach --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--ipc=host \
--name=sglang-server \
sglang-server \
python3 -m sglang.launch_server \
--model-path "mistralai/Mistral-7B-Instruct-v0.3" \
--host 0.0.0.0 --port 30000)"
sleep 3; log "checkpointing internally"
get_state() {
sudo docker exec $container_id \
sh -c "ps aux | grep -v grep | grep '^root' | grep -v cuda-checkpoint | awk '{ print \$2 }' | xargs -I{} cuda-checkpoint --get-state --pid {}" || true
}
toggle_all() {
sudo docker exec $container_id \
sh -c "ps aux | grep -v grep | grep '^root' | grep -v cuda-checkpoint | awk '{ print \$2 }' | xargs -I{} cuda-checkpoint --toggle --pid {}" || true
}
get_state
toggle_all
get_state
On an H200 machine, cuda-checkpoint
fails with OS call failed or operation not supported on this OS
when running --get-state
and --toggle
on certain PIDs spawned by SGLang. On an H100 machine, this command succeeds.
The original ticket can be found here.