Problem Statement
- On dgx-a, an idle vLLM process consistently pins the NVIDIA GB10 GPU near maximum graphics clock (~~2411 MHz), while on dgx-b an equivalent idle vLLM setup downclocks normally (~~700 MHz).
Technical Summary
-
Platform: 2 DGX systems, each with NVIDIA GB10 GPU, same software stack.
-
Workload: vLLM OpenAI server in Docker, one persistent model-serving process per node.
-
Symptom:
-
dgx-a with vLLM running and no active requests: ~2405–2411 MHz, P0, ~10–11 W, near-zero utilization.
-
dgx-b with vLLM running and no active requests: ~689–721 MHz, P0, ~3–4 W, zero/near-zero utilization.
- Control check: stopping vLLM on dgx-a immediately returns idle behavior (~208 MHz, P8, ~4–5 W).
What Was Verified
-
Same container image and same vLLM version on both nodes (
vllm==0.17.1). -
Same package set/versions across both nodes (identical package fingerprint).
-
Same model and same relevant vLLM env settings were tested on both nodes.
-
VLLM_SLEEP_WHEN_IDLE=1tested; behavior unchanged on dgx-a. -
GUI/display service stop test on dgx-a did not change pinned-clock behavior.
-
Container rebuild/re-pull and cache cleanup attempts did not change behavior.
-
NVIDIA persistence service restart did not resolve it.
-
Multiple host reboots did not resolve it.
A/B Reproduction
-
Start vLLM container on dgx-a and dgx-b with equivalent configuration.
-
Wait for health check success and no pending/running requests.
-
Sample clocks with
nvidia-smi dmon -s pucvmet -c 8. -
Observe:
-
dgx-a remains near max graphics clock at idle.
-
dgx-b downclocks normally at idle.
Expected vs Actual
-
Expected: both DGX nodes downclock similarly under idle persistent vLLM context.
-
Actual: only dgx-a remains near max graphics clock under idle persistent vLLM context.
Can you help identify why dgx-a maintains near-max graphics clocks for an idle persistent compute context while dgx-b with equivalent stack and configuration downclocks normally?
Provide recommended diagnostics and mitigation for this host-specific power-state behavior?