I compared the memory usage of the same vLLM model on GB10 and on an x86 machine with an RTX 5090.
The test command was:
docker run --rm -e VLLM_LOGGING_LEVEL=DEBUG \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v ~/.cache/vllm:/root/.cache/vllm \
--runtime=nvidia --name=vllm \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
nvcr.io/nvidia/vllm:25.09-py3 vllm serve Qwen/Qwen3-0.6B-FP8 \
--gpu-memory-utilization 0.18 --max-model-len 40960
--gpu-memory-utilization was tuned separately for each platform.
Even with the same model and vLLM image, GB10 uses more VRAM.
The following values are from the server startup logs with VLLM_LOGGING_LEVEL=DEBUG:
GB10
- Total non-KV cache memory: 4.78 GiB
- torch peak memory increase: 0.52 GiB
- non-torch forward increase memory: 3.55 GiB
- weights memory: 0.71 GiB
x86 + RTX 5090
- Total non-KV cache memory: 1.25 GiB
- torch peak memory increase: 0.52 GiB
- non-torch forward increase memory: 0.01 GiB
- weights memory: 0.71 GiB
The extra VRAM usage on GB10 mainly comes from the much larger “non-torch forward increase memory.”
Does anyone know what causes this?