When we install an LLM model and start a chat session, the response speed becomes extremely slow

subhadip1524 · December 4, 2025, 9:00am

when we install an LLM model and start a chat session, the response speed becomes extremely slow.
It is very slow both with vLLM and Ollama.
The slowdown was present from the very first chat test, and even after repeatedly performing clean OS (NVIDIA DGX OS 7) installations and retesting, the performance has not improved at all.

Additionally, when running “nvidia-smi” during a chat session, although GPU utilization reaches around 96%, the Pwr:Usage/Cap value stays at a very low “37W / N/A”, and the power consumption does not increase beyond this.
This indicates another problem.

Details are shown below, but from our perspective, we suspect the device is defective (initial failure).
Please arrange for a prompt replacement of the device, or if this is not a hardware fault, provide the technical steps required to resolve it.

■ Throughput values during vLLM chat (rounded down) → Extremely slow

vLLM: Gemma3-27b (BF16): 4.0 tokens/s
vLLM: Gemma3-27b (FP8): 7.0 tokens/s
vLLM: Qwen2.5-7B (BF16): 12.0 tokens/s

The vLLM docker image used is “nvcr.io/nvidia/vllm:25.11-py3” (latest). This image should support aarch64 (Multi-Arch).

■ Throughput values during Ollama chat (rounded down) → Extremely slow

Ollama: Gemma3-27b (4bit quantized): 11.0 tokens/s
Ollama: Llama 3.1–8B (4bit quantized): 40.0 tokens/s

vLLM (Python/PyTorch-based) and Ollama (C++/llama.cpp-based) work completely differently, yet both show slow performance.
This suggests a more fundamental underlying issue.

■ “nvidia-smi” output before launching the vLLM container

(Output omitted here—kept same as original)

■ “nvidia-smi” output after launching the vLLM container (before starting chat)

(Output omitted here—kept same as original)

■ “nvidia-smi” output during vLLM chat

(Output omitted here—kept same as original)

■ PCIe link status during vLLM chat

Output from:
sudo lspci -vv -s 01:00.0 | grep -E “LnkCap|LnkSta”

(Original output retained above)

When asking Google Gemini about these results, the following explanation was provided:

Meaning of Width x1 (downgraded)
- Normal: The specification supports 16 lanes (x16), as shown in LnkCap.
- Actual: The system is using only 1 lane (x1). It is explicitly recognized as “downgraded”.
- Result: The available bandwidth is reduced to 1/16 of normal.
Meaning of Speed 2.5GT/s
- This corresponds to PCIe Gen1, a 20-year-old standard.
- A modern GB10 should normally link at something like 32GT/s (Gen5).
- Result: The speed is reduced to less than 1/10 of expected.
Meaning of “pcilib: read failed: No such device”
- The system cannot even read the GPU’s VPD (device identification). The connection is barely functioning.
Suggested action: Send the unit for repair immediately.
This cannot be fixed by software (Docker, BIOS, etc.). It is a physical hardware problem.
Possible causes:
- Poor contact: GPU partially unseated due to transportation vibration
- Pin/connector damage: dirt, bent pin, motherboard slot issue
- Initial hardware defect in the motherboard’s controller

■ Abnormal behavior during multimodal inference

When performing image recognition with Gemma 3 27B, attaching a clear automobile image and asking “What is this?”, the model responds with:
“The image shows repeated glitch-effect faces,” which is completely incorrect.

Testing with the same image on a different machine produces accurate results, including correct car model identification.
Therefore, this is not a Gemma 3 issue.
It is likely that data corruption is occurring either in:

the Vision Encoder’s computation, or
the GPU tensor transfer process.

cosinus · December 6, 2025, 9:25am

I recommend a look at this blog article (“using spark for inference”).

The DGX Spark does not perform well with dense models like Gemma 3 27B. You should use llama.cpp for the moment if you want the best possible t/s rate. For more details on what is possible with a Spark have a look at:

vLLM is not yet optimized for DGX Spark. It still misses optimized kernels for NVFP4 where the Spark will give it another boost. For the moment AWQ seems to be still better in terms of performance when using vLLM.

If vision capabilities is something you need, you could give Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 or cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit a try. It is an MoE with 3B active parameters. Those perform much better on Spark (and other machines with unified memory).

for vLLM or for llama.cpp:

in a Q4_K_M or Q8_0 - test which quant gives you best quality/speeds.

Instructions on who to build llama.cpp on your own (as currently no arm64 containers are available):

Topic		Replies	Views
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10 deepseek	21	3680	January 25, 2026
Very poor performance with Ollama on DGX Spark – looking for help DGX Spark / GB10 Projects	8	2015	January 20, 2026
Investigating Performance Issue/Bottleneck DGX Spark / GB10 llama , agentic-ai	9	649	February 1, 2026
Models not using Spark GPU? DGX Spark / GB10 containers	10	657	December 15, 2025
Distributed Inference - 200gb/s with bottleneck, am I missing something? DGX Spark / GB10 llama	5	509	January 22, 2026
Run Qwen3.5-27B with spark-vllm-docker DGX Spark / GB10 llama	1	1809	March 5, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	33	2005	January 2, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2865	December 31, 2025
DGX Spark performance DGX Spark / GB10	50	4314	February 27, 2026
I'd like to learn how to use the latest vLLM on DGX Spark DGX Spark / GB10 cuda	9	2168	November 29, 2025

When we install an LLM model and start a chat session, the response speed becomes extremely slow

Related topics