With the same model and vLLM image, GB10 uses more VRAM than x86 + GPU

I compared the memory usage of the same vLLM model on GB10 and on an x86 machine with an RTX 5090.

The test command was:

docker run --rm -e VLLM_LOGGING_LEVEL=DEBUG \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ~/.cache/vllm:/root/.cache/vllm \
  --runtime=nvidia --name=vllm \
  -e VLLM_ATTENTION_BACKEND=FLASHINFER \
  nvcr.io/nvidia/vllm:25.09-py3 vllm serve Qwen/Qwen3-0.6B-FP8 \
  --gpu-memory-utilization 0.18 --max-model-len 40960

--gpu-memory-utilization was tuned separately for each platform.

Even with the same model and vLLM image, GB10 uses more VRAM.

The following values are from the server startup logs with VLLM_LOGGING_LEVEL=DEBUG:

GB10

  • Total non-KV cache memory: 4.78 GiB
  • torch peak memory increase: 0.52 GiB
  • non-torch forward increase memory: 3.55 GiB
  • weights memory: 0.71 GiB

x86 + RTX 5090

  • Total non-KV cache memory: 1.25 GiB
  • torch peak memory increase: 0.52 GiB
  • non-torch forward increase memory: 0.01 GiB
  • weights memory: 0.71 GiB

The extra VRAM usage on GB10 mainly comes from the much larger “non-torch forward increase memory.”

Does anyone know what causes this?

After tracing the code, I discovered that “non-torch memory” refers to CUDA memory allocated by components that vLLM doesn’t track. Therefore, on UMA devices, RAM/VRAM usage by the CPU/OS is also counted as “non-torch forward increase.”

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.