With the same model and vLLM image, GB10 uses more VRAM than x86 + GPU

changtimwu · December 4, 2025, 11:27am

I compared the memory usage of the same vLLM model on GB10 and on an x86 machine with an RTX 5090.

The test command was:

docker run --rm -e VLLM_LOGGING_LEVEL=DEBUG \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ~/.cache/vllm:/root/.cache/vllm \
  --runtime=nvidia --name=vllm \
  -e VLLM_ATTENTION_BACKEND=FLASHINFER \
  nvcr.io/nvidia/vllm:25.09-py3 vllm serve Qwen/Qwen3-0.6B-FP8 \
  --gpu-memory-utilization 0.18 --max-model-len 40960

--gpu-memory-utilization was tuned separately for each platform.

Even with the same model and vLLM image, GB10 uses more VRAM.

The following values are from the server startup logs with VLLM_LOGGING_LEVEL=DEBUG:

GB10

Total non-KV cache memory: 4.78 GiB
torch peak memory increase: 0.52 GiB
non-torch forward increase memory: 3.55 GiB
weights memory: 0.71 GiB

x86 + RTX 5090

Total non-KV cache memory: 1.25 GiB
torch peak memory increase: 0.52 GiB
non-torch forward increase memory: 0.01 GiB
weights memory: 0.71 GiB

The extra VRAM usage on GB10 mainly comes from the much larger “non-torch forward increase memory.”

Does anyone know what causes this?

changtimwu · December 18, 2025, 1:49am

After tracing the code, I discovered that “non-torch memory” refers to CUDA memory allocated by components that vLLM doesn’t track. Therefore, on UMA devices, RAM/VRAM usage by the CPU/OS is also counted as “non-torch forward increase.”

system · January 1, 2026, 1:49am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Distributed Inference - 200gb/s with bottleneck, am I missing something? DGX Spark / GB10 llama	5	327	January 22, 2026
Vllm docker-compose - on DGX Spark from first time user looking for suggestions and question about RAM utilization DGX Spark / GB10 docker	6	1099	December 10, 2025
Issues with VRAM allocation while fine tuning LLM Linux gaming	1	189	September 16, 2025
DGX Spark GB10 shows only ~56GB VRAM inside AI Workbench (128GB expected) DGX Spark / GB10	8	649	December 30, 2025
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	1749	December 25, 2025
LMStudio Error: Cannot obtain free VRAM bytes for GPU0: NVIDIA GB10 DGX Spark / GB10	6	289	December 1, 2025
GPU Memory on K80 vs V100 Frameworks (archived) pytorch	0	1151	August 11, 2020
Can I use Ollama or vLLM on the GB10 to run multiple LLM models simultaneously DGX Spark / GB10	8	578	December 13, 2025
vLLM v0.8.4 shows UVM GPU1 BH process with high utilization CUDA Programming and Performance	7	601	April 25, 2025
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	1877	December 31, 2025

With the same model and vLLM image, GB10 uses more VRAM than x86 + GPU

GB10

x86 + RTX 5090

Related topics