When we install an LLM model and start a chat session, the response speed becomes extremely slow

when we install an LLM model and start a chat session, the response speed becomes extremely slow.
It is very slow both with vLLM and Ollama.
The slowdown was present from the very first chat test, and even after repeatedly performing clean OS (NVIDIA DGX OS 7) installations and retesting, the performance has not improved at all.

Additionally, when running “nvidia-smi” during a chat session, although GPU utilization reaches around 96%, the Pwr:Usage/Cap value stays at a very low “37W / N/A”, and the power consumption does not increase beyond this.
This indicates another problem.

Details are shown below, but from our perspective, we suspect the device is defective (initial failure).
Please arrange for a prompt replacement of the device, or if this is not a hardware fault, provide the technical steps required to resolve it.


■ Throughput values during vLLM chat (rounded down) → Extremely slow

  • vLLM: Gemma3-27b (BF16): 4.0 tokens/s

  • vLLM: Gemma3-27b (FP8): 7.0 tokens/s

  • vLLM: Qwen2.5-7B (BF16): 12.0 tokens/s

The vLLM docker image used is “nvcr.io/nvidia/vllm:25.11-py3” (latest). This image should support aarch64 (Multi-Arch).


■ Throughput values during Ollama chat (rounded down) → Extremely slow

  • Ollama: Gemma3-27b (4bit quantized): 11.0 tokens/s

  • Ollama: Llama 3.1–8B (4bit quantized): 40.0 tokens/s

vLLM (Python/PyTorch-based) and Ollama (C++/llama.cpp-based) work completely differently, yet both show slow performance.
This suggests a more fundamental underlying issue.


■ “nvidia-smi” output before launching the vLLM container

(Output omitted here—kept same as original)


■ “nvidia-smi” output after launching the vLLM container (before starting chat)

(Output omitted here—kept same as original)


■ “nvidia-smi” output during vLLM chat

(Output omitted here—kept same as original)


■ PCIe link status during vLLM chat

Output from:
sudo lspci -vv -s 01:00.0 | grep -E “LnkCap|LnkSta”

(Original output retained above)

When asking Google Gemini about these results, the following explanation was provided:

  1. Meaning of Width x1 (downgraded)

    • Normal: The specification supports 16 lanes (x16), as shown in LnkCap.

    • Actual: The system is using only 1 lane (x1). It is explicitly recognized as “downgraded”.

    • Result: The available bandwidth is reduced to 1/16 of normal.

  2. Meaning of Speed 2.5GT/s

    • This corresponds to PCIe Gen1, a 20-year-old standard.

    • A modern GB10 should normally link at something like 32GT/s (Gen5).

    • Result: The speed is reduced to less than 1/10 of expected.

  3. Meaning of “pcilib: read failed: No such device”

    • The system cannot even read the GPU’s VPD (device identification). The connection is barely functioning.
  4. Suggested action: Send the unit for repair immediately.
    This cannot be fixed by software (Docker, BIOS, etc.). It is a physical hardware problem.
    Possible causes:

    • Poor contact: GPU partially unseated due to transportation vibration

    • Pin/connector damage: dirt, bent pin, motherboard slot issue

    • Initial hardware defect in the motherboard’s controller


■ Abnormal behavior during multimodal inference

When performing image recognition with Gemma 3 27B, attaching a clear automobile image and asking “What is this?”, the model responds with:
“The image shows repeated glitch-effect faces,” which is completely incorrect.

Testing with the same image on a different machine produces accurate results, including correct car model identification.
Therefore, this is not a Gemma 3 issue.
It is likely that data corruption is occurring either in:

  • the Vision Encoder’s computation, or

  • the GPU tensor transfer process.

I recommend a look at this blog article (“using spark for inference”).

The DGX Spark does not perform well with dense models like Gemma 3 27B. You should use llama.cpp for the moment if you want the best possible t/s rate. For more details on what is possible with a Spark have a look at:

vLLM is not yet optimized for DGX Spark. It still misses optimized kernels for NVFP4 where the Spark will give it another boost. For the moment AWQ seems to be still better in terms of performance when using vLLM.

If vision capabilities is something you need, you could give Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 or cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit a try. It is an MoE with 3B active parameters. Those perform much better on Spark (and other machines with unified memory).

for vLLM or for llama.cpp:

in a Q4_K_M or Q8_0 - test which quant gives you best quality/speeds.

Instructions on who to build llama.cpp on your own (as currently no arm64 containers are available):

2 Likes