when we install an LLM model and start a chat session, the response speed becomes extremely slow.
It is very slow both with vLLM and Ollama.
The slowdown was present from the very first chat test, and even after repeatedly performing clean OS (NVIDIA DGX OS 7) installations and retesting, the performance has not improved at all.
Additionally, when running “nvidia-smi” during a chat session, although GPU utilization reaches around 96%, the Pwr:Usage/Cap value stays at a very low “37W / N/A”, and the power consumption does not increase beyond this.
This indicates another problem.
Details are shown below, but from our perspective, we suspect the device is defective (initial failure).
Please arrange for a prompt replacement of the device, or if this is not a hardware fault, provide the technical steps required to resolve it.
■ Throughput values during vLLM chat (rounded down) → Extremely slow
-
vLLM: Gemma3-27b (BF16): 4.0 tokens/s
-
vLLM: Gemma3-27b (FP8): 7.0 tokens/s
-
vLLM: Qwen2.5-7B (BF16): 12.0 tokens/s
The vLLM docker image used is “nvcr.io/nvidia/vllm:25.11-py3” (latest). This image should support aarch64 (Multi-Arch).
■ Throughput values during Ollama chat (rounded down) → Extremely slow
-
Ollama: Gemma3-27b (4bit quantized): 11.0 tokens/s
-
Ollama: Llama 3.1–8B (4bit quantized): 40.0 tokens/s
vLLM (Python/PyTorch-based) and Ollama (C++/llama.cpp-based) work completely differently, yet both show slow performance.
This suggests a more fundamental underlying issue.
■ “nvidia-smi” output before launching the vLLM container
(Output omitted here—kept same as original)
■ “nvidia-smi” output after launching the vLLM container (before starting chat)
(Output omitted here—kept same as original)
■ “nvidia-smi” output during vLLM chat
(Output omitted here—kept same as original)
■ PCIe link status during vLLM chat
Output from:
sudo lspci -vv -s 01:00.0 | grep -E “LnkCap|LnkSta”
(Original output retained above)
When asking Google Gemini about these results, the following explanation was provided:
-
Meaning of Width x1 (downgraded)
-
Normal: The specification supports 16 lanes (x16), as shown in LnkCap.
-
Actual: The system is using only 1 lane (x1). It is explicitly recognized as “downgraded”.
-
Result: The available bandwidth is reduced to 1/16 of normal.
-
-
Meaning of Speed 2.5GT/s
-
This corresponds to PCIe Gen1, a 20-year-old standard.
-
A modern GB10 should normally link at something like 32GT/s (Gen5).
-
Result: The speed is reduced to less than 1/10 of expected.
-
-
Meaning of “pcilib: read failed: No such device”
- The system cannot even read the GPU’s VPD (device identification). The connection is barely functioning.
-
Suggested action: Send the unit for repair immediately.
This cannot be fixed by software (Docker, BIOS, etc.). It is a physical hardware problem.
Possible causes:-
Poor contact: GPU partially unseated due to transportation vibration
-
Pin/connector damage: dirt, bent pin, motherboard slot issue
-
Initial hardware defect in the motherboard’s controller
-
■ Abnormal behavior during multimodal inference
When performing image recognition with Gemma 3 27B, attaching a clear automobile image and asking “What is this?”, the model responds with:
“The image shows repeated glitch-effect faces,” which is completely incorrect.
Testing with the same image on a different machine produces accurate results, including correct car model identification.
Therefore, this is not a Gemma 3 issue.
It is likely that data corruption is occurring either in:
-
the Vision Encoder’s computation, or
-
the GPU tensor transfer process.