An idle vLLM process consistently pins the NVIDIA GB10 GPU at max graphics clock

Problem Statement

  • On dgx-a, an idle vLLM process consistently pins the NVIDIA GB10 GPU near maximum graphics clock (~~2411 MHz), while on dgx-b an equivalent idle vLLM setup downclocks normally (~~700 MHz).

Technical Summary

  • Platform: 2 DGX systems, each with NVIDIA GB10 GPU, same software stack.

  • Workload: vLLM OpenAI server in Docker, one persistent model-serving process per node.

  • Symptom:

  1. dgx-a with vLLM running and no active requests: ~2405–2411 MHz, P0, ~10–11 W, near-zero utilization.

  2. dgx-b with vLLM running and no active requests: ~689–721 MHz, P0, ~3–4 W, zero/near-zero utilization.

  • Control check: stopping vLLM on dgx-a immediately returns idle behavior (~208 MHz, P8, ~4–5 W).

What Was Verified

  • Same container image and same vLLM version on both nodes (vllm==0.17.1).

  • Same package set/versions across both nodes (identical package fingerprint).

  • Same model and same relevant vLLM env settings were tested on both nodes.

  • VLLM_SLEEP_WHEN_IDLE=1 tested; behavior unchanged on dgx-a.

  • GUI/display service stop test on dgx-a did not change pinned-clock behavior.

  • Container rebuild/re-pull and cache cleanup attempts did not change behavior.

  • NVIDIA persistence service restart did not resolve it.

  • Multiple host reboots did not resolve it.

A/B Reproduction

  1. Start vLLM container on dgx-a and dgx-b with equivalent configuration.

  2. Wait for health check success and no pending/running requests.

  3. Sample clocks with nvidia-smi dmon -s pucvmet -c 8.

  4. Observe:

  • dgx-a remains near max graphics clock at idle.

  • dgx-b downclocks normally at idle.

Expected vs Actual

  • Expected: both DGX nodes downclock similarly under idle persistent vLLM context.

  • Actual: only dgx-a remains near max graphics clock under idle persistent vLLM context.

Can you help identify why dgx-a maintains near-max graphics clocks for an idle persistent compute context while dgx-b with equivalent stack and configuration downclocks normally?

Provide recommended diagnostics and mitigation for this host-specific power-state behavior?