Local K8s Cluster with minikube on Nvidia DGX Spark

shahizat · December 29, 2025, 7:52pm

Hello all,

In this guide, I provide instructions for setting up a local Kubernetes environment on NVIDIA DGX Spark. We will use Minikube to deploy LLMs such as Qwen3-4B using vLLM from NGC(25.12-py3).

Install Minikube

# Download the ARM64 binary
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-arm64
sudo install minikube-linux-arm64 /usr/local/bin/minikube

Install Kubectl

# Download the ARM64 binary for kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/arm64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

# Set up an alias and update your shell configuration
echo 'alias kubectl="minikube kubectl --"' >> ~/.bashrc
source ~/.bashrc

Start Minikube using the Docker driver. The --gpus=all flag allows the cluster to access the DGX Spark GPU.

minikube start \
    --driver=docker \
    --container-runtime=docker \
    --gpus=all \
    --cpus=max \
    --memory=max

Before deploying the model, verify that the NVIDIA runtime is working within the cluster.

# Run a temporary GPU test pod
kubectl run gpu-test --image=nvidia/cuda:13.1.0-runtime-ubuntu22.04 --restart=Never -- nvidia-smi

# Check the logs to see the Blackwell (GB10) GPU status
kubectl logs gpu-test

Output

Create pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: qwen3-4b-storage
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50G

Create secret.yaml. Replace YOUR_ACTUAL_TOKEN with your Hugging Face Read Token.

apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
stringData:
  token: "YOUR_ACTUAL_TOKEN"

Apply these to the cluster:

kubectl apply -f pvc.yaml
kubectl apply -f secret.yaml

Deploy Qwen3-4B using vLLM

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen3-4b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwen3-4b
  template:
    metadata:
      labels:
        app: qwen3-4b
    spec:
      volumes:
        - name: cache-volume
          persistentVolumeClaim:
            claimName: qwen3-4b-storage
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "2Gi"
      containers:
        - name: qwen3-4b
          image: nvcr.io/nvidia/vllm:25.12-py3
          command: ["/bin/sh", "-c"]
          args: [
            "vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --trust-remote-code --max-model-len 32768 --gpu-memory-utilization 0.45"
          ]
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token-secret
                  key: token
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: cache-volume
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm

Apply the deployment:

kubectl apply -f deployment.yaml

Once the pod is running (it may take a few minutes to download the model), use port-forwarding to access the API locally.

kubectl port-forward svc/qwen3-4b 8000:8000

Test the Inference

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain the Grace Blackwell architecture on DGX Spark."}
    ],
    "max_tokens": 100
  }'

shahizat · January 1, 2026, 4:45pm

Hello everyone again!

I’ve added instructions on how to run the vLLM production stack on the NVIDIA DGX SPARK. The deployment includes a vLLM router and two pods(replicas) running Qwen-3-4B with vLLM instances for high availability.

Install helm using below command:

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Clone the vLLM production-stack repository.

git clone https://github.com/vllm-project/production-stack.git
cd production-stack

Add the helm repo

helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update

The default vLLM router image on Docker Hub is built for x86. For the DGX Spark, we must build a custom ARM64 image locally.

Build the router image locally on the DGX Spark. Run this command from the production-stack root:

docker build -t vllm-router-arm64:latest \
  --build-arg INSTALL_OPTIONAL_DEP=semantic_cache \
  -f docker/Dockerfile .

Once the image builds successfully, you’ll need to push it to your local Minikube node so it can be used in your deployment:

minikube image load vllm-router-arm64:latest

Create a custom values file named dgx-spark-values.yaml with the following content. This overrides the default image with one that supports DGX Spark

# values-dgx-spark.yaml

# -----------------------------------------------------------------------------
# Serving Engine Configuration (The LLM Backend)
# -----------------------------------------------------------------------------
servingEngineSpec:
  strategy:
    type: Recreate
  runtimeClassName: "" 
  modelSpec:
  - name: "qwen-3-4b"
    repository: "nvcr.io/nvidia/vllm"
    tag: "25.12-py3"
    modelURL: "Qwen/Qwen3-4B"
    replicaCount: 2
    requestCPU: 8
    requestMemory: "2Gi"

    vllmConfig:
      maxModelLen: 8192
      gpuMemoryUtilization: 0.4
      enablePrefixCaching: true
      extraArgs: ["--disable-log-requests"]

# -----------------------------------------------------------------------------
# Router Configuration
# -----------------------------------------------------------------------------

routerSpec:
  enableRouter: true
  
  repository: "vllm-router-arm64"
  tag: "latest"
  imagePullPolicy: "Never" 
  
  # Port mapping
  containerPort: 8000
  
  # Service Discovery
  serviceDiscovery: "k8s" 
  k8sServiceDiscoveryType: "pod-ip"

  # Routing logic
  routingLogic: "roundrobin"

Apply yaml file:

 helm install spark-vllm vllm/vllm-stack -f values-dgx-spark.yaml

This would take some time when you run it for the first time, as it would need to fetch both the model weights and the vLLM NGC Docker image.

Verify the deployment:

kubectl get pods
NAME                                                    READY   STATUS    RESTARTS   AGE
spark-vllm-deployment-router-6f977d446d-pnhmk           1/1     Running   0          37m
spark-vllm-qwen-3-4b-deployment-vllm-7499f9c8b8-glc96   1/1     Running   0          37m
spark-vllm-qwen-3-4b-deployment-vllm-7499f9c8b8-h7d78   1/1     Running   0          37m

Expose the port for the service:

kubectl port-forward svc/spark-vllm-router-service 30080:80

Check model availability:

curl -o- http://localhost:30080/v1/models
{"object":"list","data":[{"id":"Qwen/Qwen3-4B","object":"model","created":1767283942,"owned_by":"vllm","root":null,"parent":null}]}

Send a Chat Completion Request: The router will balance this request between your two Qwen instances

curl http://localhost:30080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen3-4B",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    "temperature": 0.7
  }'

{"id":"chatcmpl-24057c15171943a5b18ff52e47b558cb","object":"chat.completion","created":1767284288,"model":"Qwen/Qwen3-4B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOkay, the user wants a one-sentence explanation of quantum computing. Let me start by recalling what I know. Quantum computing uses qubits, which can be in superposition, so they can represent 0 and 1 simultaneously. That's different from classical bits. Also, entanglement is a key aspect, where qubits are linked and the state of one affects the other. Then there's quantum interference, which helps in amplifying correct answers and canceling out wrong ones. The goal is to solve certain problems much faster than classical computers, like factoring large numbers or simulating molecules. But I need to keep it concise. Maybe start with \"Quantum computing leverages...\" and mention qubits, superposition, entanglement, and the potential for exponential speedup. But how to fit all that into one sentence without being too technical? Maybe something like: \"Quantum computing uses qubits that exist in superpositions of states and entanglement to perform complex calculations exponentially faster than classical computers for specific problems.\" Wait, does that cover the main points? Maybe mention the purpose, like solving certain problems efficiently. Let me check if that's accurate. Yeah, that seems right. But maybe \"exponential speedup\" is important. Also, mention that it's for specific problems, since not all problems benefit from it. Okay, that should work.\n</think>\n\nQuantum computing leverages qubits that exist in superpositions of states and entanglement to perform complex calculations exponentially faster than classical computers for specific problems like factoring large numbers or simulating molecules.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":338,"completion_tokens":322,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

shahizat · January 2, 2026, 8:33am

This guide demonstrates how to implement the vLLM Observability Stack on a DGX Spark using Prometheus and Grafana. These modules gather critical metrics such as TTFT (Time-To-First-Token), ITL (Inter-Token Latency), and Throughput, providing real-time insights into your model’s performance.

Navigate to the observability directory and execute the installation script. This script deploys the kube-prometheus-stack, which includes Prometheus, Grafana, and the necessary operators.

cd observability
bash install.sh

Output

NAME: kube-prom-stack
LAST DEPLOYED: Thu Jan  1 23:27:27 2026
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
  kubectl --namespace monitoring get pods -l "release=kube-prom-stack"

Get Grafana 'admin' user password by running:

  kubectl --namespace monitoring get secrets kube-prom-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo

Access Grafana local instance:

  export POD_NAME=$(kubectl --namespace monitoring get pod -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kube-prom-stack" -oname)
  kubectl --namespace monitoring port-forward $POD_NAME 3000

Get your grafana admin user password by running:

  kubectl get secret --namespace monitoring -l app.kubernetes.io/component=admin-secret -o jsonpath="{.items[0].data.admin-password}" | base64 --decode ; echo


Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
Release "prometheus-adapter" does not exist. Installing it now.
^[[ANAME: prometheus-adapter
LAST DEPLOYED: Thu Jan  1 23:28:42 2026
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
prometheus-adapter has been deployed.
In a few minutes you should be able to list metrics using the following command(s):

  kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1

Run this command to map the Grafana service to the DGX’s local port 3000

kubectl port-forward -n monitoring svc/kube-prom-stack-grafana 3000:80 --address 0.0.0.0

Establish an SSH Tunnel from your Remote PC

ssh -L 3000:localhost:3000 spark@<DGX-IP-ADDRESS>

Open your browser on your Remote PC. Navigate to: http://127.0.0.1:3000

User: admin
Password: (Check your values.yaml, likely prom-operator)

Import the vLLM Dashboard

In Grafana, go to Dashboards → Import.
Upload the vllm-dashboard.json file found in your observability folder(production-stack/observability/vllm-dashboard.json at main · vllm-project/production-stack · GitHub) .
Select Prometheus as the data source when prompted.

Open vLLM dashboard

To verify the observability stack is working, run a benchmark to generate traffic. This test sends 1,000 requests with a concurrency of 100 to the Qwen3 model.

vllm bench serve \
    --backend vllm \
    --base-url http://localhost:30080 \
    --model Qwen/Qwen3-4B \
    --dataset-name random \
    --random-input-len 128 \
    --random-output-len 128 \
    --num-prompts 1000 \
    --max-concurrency 100 \
    --temperature 0.7

Output of bench:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             100       
Benchmark duration (s):                  175.69    
Total input tokens:                      128000    
Total generated tokens:                  128000    
Request throughput (req/s):              5.69      
Output token throughput (tok/s):         728.54    
Peak output token throughput (tok/s):    850.00    
Peak concurrent requests:                195.00    
Total token throughput (tok/s):          1457.08   
---------------Time to First Token----------------
Mean TTFT (ms):                          814.82    
Median TTFT (ms):                        750.91    
P99 TTFT (ms):                           1769.83   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          131.70    
Median TPOT (ms):                        131.66    
P99 TPOT (ms):                           135.67    
---------------Inter-token Latency----------------
Mean ITL (ms):                           131.70    
Median ITL (ms):                         126.25    
P99 ITL (ms):                            421.22    
==================================================

Dashboard view:

If your dashboard shows “No Data” for GPU Cache Usage or GPU KV Cache Hit Rate, it is likely due to the nvidia-device-plugin crashing. It seems that there is no official support of Device plugin on DGX Spark: does device plugin support GB10 (NVIDIA DGX Spark) · Issue #1482 · NVIDIA/k8s-device-plugin · GitHub

raphael.amorim · January 2, 2026, 5:00pm

Thanks for sharing @shahizat! This is great

Topic		Replies	Views
Local Kubernetes Cluster with K3s on Nvidia DGX Spark DGX Spark / GB10 Projects	3	1451	June 6, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	4817	March 6, 2026
HOW-TO: setup-dgx-spark docker inference - A "Sane" Inference Stack for GB10 (Need Contributors!) DGX Spark / GB10 Projects docker , llama , dgx	39	2723	June 21, 2026
Running a Full LLM Stack on DGX Spark GB10 (Your Application -> LiteLLM -> llama-swap -> vLLM / llama.cpp / Ollama) DGX Spark / GB10 Projects spark , jetson , llama , nemotron , openclaw	19	3652	May 28, 2026
Spark-inference: Run 3 specialized models simultaneously on your DGX Spark — cybersecurity + coding + orchestration, 30-min setup DGX Spark / GB10 Projects jetson , llama , deepseek , nemotron	3	1267	May 11, 2026
Can someone please just help me set the DGX Spark up for optimal LLM use? DGX Spark / GB10 llama	11	1145	June 20, 2026
Vibe Coding with NVIDIA DGX Spark DGX Spark / GB10	39	5664	May 10, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	34	2618	May 1, 2026
Spark: one script CLI for setup, remote access, and LLM serving on DGX Spark DGX Spark / GB10 Projects cuda , docker , spark , llm , deepseek	3	436	May 21, 2026
DGX Spark performance DGX Spark / GB10	49	6249	February 13, 2026

Local K8s Cluster with minikube on Nvidia DGX Spark

Related topics