Running cuda-checkpoint on SGLang fails on H200, but succeeds on H100

matt452 · September 23, 2025, 7:26pm

This ticket demonstrates a usage of cuda-checkpoint which fails on H200s, but succeeds on H100s. In this reproduction, we are attempting to run cuda-checkpoint on and SGLang server.

Driver version: 575.57.08
CUDA version: 12.9

Steps to reproduce

Step 1: Build the Dockerfile

FROM lmsysorg/sglang:v0.5.0rc2-cu126

RUN apt-get update && apt-get install -y wget && \
    wget https://github.com/NVIDIA/cuda-checkpoint/raw/refs/heads/main/bin/x86_64_Linux/cuda-checkpoint -O /usr/bin/cuda-checkpoint && \
    chmod +x /usr/bin/cuda-checkpoint

RUN pip install protobuf

Then,

docker build . --tag sglang-server

Step 2: Run the container, and run cuda-checkpoint inside

#!/bin/bash

set -ex

log() {
  echo "[$(date +"%H:%M:%S")]" "$@" >&2
}

container_id="$(sudo docker run --detach --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    --name=sglang-server \
    sglang-server \
    python3 -m sglang.launch_server \
        --model-path "mistralai/Mistral-7B-Instruct-v0.3" \
        --host 0.0.0.0 --port 30000)"

sleep 3; log "checkpointing internally"

get_state() {
  sudo docker exec $container_id \
    sh -c "ps aux | grep -v grep | grep '^root' | grep -v cuda-checkpoint | awk '{ print \$2 }' | xargs -I{} cuda-checkpoint --get-state --pid {}" || true
}

toggle_all() {
  sudo docker exec $container_id \
    sh -c "ps aux | grep -v grep | grep '^root' | grep -v cuda-checkpoint | awk '{ print \$2 }' | xargs -I{} cuda-checkpoint --toggle --pid {}" || true
}

get_state
toggle_all
get_state

On an H200 machine, cuda-checkpoint fails with OS call failed or operation not supported on this OS when running --get-state and --toggle on certain PIDs spawned by SGLang. On an H100 machine, this command succeeds.

The original ticket can be found here.

Topic		Replies	Views
Cannot run any CUDA kernels CUDA runtime doesn't recognize NVIDIA GPU CUDA Programming and Performance	26	12504	August 24, 2010
CUDA, Linux Ubuntu 10.04 and strange mismatch version CUDA Programming and Performance	26	19194	November 18, 2010
CUDA Driver and Runtime version mismatch problem CUDA Programming and Performance	15	20324	September 20, 2010
Driver compatibility? CUDA Programming and Performance	1	4013	December 3, 2008
S1070 device 0 broken Test case provided CUDA Programming and Performance	10	4339	June 9, 2009
Tesla C1060 on asus P5ld2 "There is no device supporting cuda" CUDA Programming and Performance	6	11270	October 8, 2009
kernel works on Gtx280/295/480 but not on C2050 unspecified launch failure CUDA Programming and Performance	38	2985	September 23, 2010
problem running demos CUDA Programming and Performance	9	8225	January 1, 2009
no CUDA-capable device is available CUDA Programming and Performance	2	17138	November 27, 2009
cuda-memcheck --tool racecheck Internal Memcheck Error: Device not supported CUDA Programming and Performance	3	1916	August 2, 2013

Running cuda-checkpoint on SGLang fails on H200, but succeeds on H100

Related topics