Local K8s Cluster with minikube on Nvidia DGX Spark

Hello all,

In this guide, I provide instructions for setting up a local Kubernetes environment on NVIDIA DGX Spark. We will use Minikube to deploy LLMs such as Qwen3-4B using vLLM from NGC(25.12-py3).

Install Minikube

# Download the ARM64 binary
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-arm64
sudo install minikube-linux-arm64 /usr/local/bin/minikube

Install Kubectl

# Download the ARM64 binary for kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/arm64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

# Set up an alias and update your shell configuration
echo 'alias kubectl="minikube kubectl --"' >> ~/.bashrc
source ~/.bashrc

Start Minikube using the Docker driver. The --gpus=all flag allows the cluster to access the DGX Spark GPU.

minikube start \
    --driver=docker \
    --container-runtime=docker \
    --gpus=all \
    --cpus=max \
    --memory=max

Before deploying the model, verify that the NVIDIA runtime is working within the cluster.

# Run a temporary GPU test pod
kubectl run gpu-test --image=nvidia/cuda:13.1.0-runtime-ubuntu22.04 --restart=Never -- nvidia-smi

# Check the logs to see the Blackwell (GB10) GPU status
kubectl logs gpu-test

Output

Create pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: qwen3-4b-storage
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50G

Create secret.yaml. Replace YOUR_ACTUAL_TOKEN with your Hugging Face Read Token.

apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
stringData:
  token: "YOUR_ACTUAL_TOKEN"

Apply these to the cluster:

kubectl apply -f pvc.yaml
kubectl apply -f secret.yaml

Deploy Qwen3-4B using vLLM

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen3-4b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwen3-4b
  template:
    metadata:
      labels:
        app: qwen3-4b
    spec:
      volumes:
        - name: cache-volume
          persistentVolumeClaim:
            claimName: qwen3-4b-storage
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "2Gi"
      containers:
        - name: qwen3-4b
          image: nvcr.io/nvidia/vllm:25.12-py3
          command: ["/bin/sh", "-c"]
          args: [
            "vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --trust-remote-code --max-model-len 32768 --gpu-memory-utilization 0.45"
          ]
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token-secret
                  key: token
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: cache-volume
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm

Apply the deployment:

kubectl apply -f deployment.yaml

Once the pod is running (it may take a few minutes to download the model), use port-forwarding to access the API locally.

kubectl port-forward svc/qwen3-4b 8000:8000

Test the Inference

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain the Grace Blackwell architecture on DGX Spark."}
    ],
    "max_tokens": 100
  }'

Hello everyone again!

I’ve added instructions on how to run the vLLM production stack on the NVIDIA DGX SPARK. The deployment includes a vLLM router and two pods(replicas) running Qwen-3-4B with vLLM instances for high availability.

Install helm using below command:

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Clone the vLLM production-stack repository.

git clone https://github.com/vllm-project/production-stack.git
cd production-stack

Add the helm repo

helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update

The default vLLM router image on Docker Hub is built for x86. For the DGX Spark, we must build a custom ARM64 image locally.

Build the router image locally on the DGX Spark. Run this command from the production-stack root:

docker build -t vllm-router-arm64:latest \
  --build-arg INSTALL_OPTIONAL_DEP=semantic_cache \
  -f docker/Dockerfile .

Once the image builds successfully, you’ll need to push it to your local Minikube node so it can be used in your deployment:

minikube image load vllm-router-arm64:latest

Create a custom values file named dgx-spark-values.yaml with the following content. This overrides the default image with one that supports DGX Spark

# values-dgx-spark.yaml

# -----------------------------------------------------------------------------
# Serving Engine Configuration (The LLM Backend)
# -----------------------------------------------------------------------------
servingEngineSpec:
  strategy:
    type: Recreate
  runtimeClassName: "" 
  modelSpec:
  - name: "qwen-3-4b"
    repository: "nvcr.io/nvidia/vllm"
    tag: "25.12-py3"
    modelURL: "Qwen/Qwen3-4B"
    replicaCount: 2
    requestCPU: 8
    requestMemory: "2Gi"

    vllmConfig:
      maxModelLen: 8192
      gpuMemoryUtilization: 0.4
      enablePrefixCaching: true
      extraArgs: ["--disable-log-requests"]

# -----------------------------------------------------------------------------
# Router Configuration
# -----------------------------------------------------------------------------

routerSpec:
  enableRouter: true
  
  repository: "vllm-router-arm64"
  tag: "latest"
  imagePullPolicy: "Never" 
  
  # Port mapping
  containerPort: 8000
  
  # Service Discovery
  serviceDiscovery: "k8s" 
  k8sServiceDiscoveryType: "pod-ip"

  # Routing logic
  routingLogic: "roundrobin"


Apply yaml file:

 helm install spark-vllm vllm/vllm-stack -f values-dgx-spark.yaml

This would take some time when you run it for the first time, as it would need to fetch both the model weights and the vLLM NGC Docker image.

Verify the deployment:

kubectl get pods
NAME                                                    READY   STATUS    RESTARTS   AGE
spark-vllm-deployment-router-6f977d446d-pnhmk           1/1     Running   0          37m
spark-vllm-qwen-3-4b-deployment-vllm-7499f9c8b8-glc96   1/1     Running   0          37m
spark-vllm-qwen-3-4b-deployment-vllm-7499f9c8b8-h7d78   1/1     Running   0          37m

Expose the port for the service:

kubectl port-forward svc/spark-vllm-router-service 30080:80

Check model availability:

curl -o- http://localhost:30080/v1/models
{"object":"list","data":[{"id":"Qwen/Qwen3-4B","object":"model","created":1767283942,"owned_by":"vllm","root":null,"parent":null}]}

Send a Chat Completion Request: The router will balance this request between your two Qwen instances

curl http://localhost:30080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen3-4B",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    "temperature": 0.7
  }'

{"id":"chatcmpl-24057c15171943a5b18ff52e47b558cb","object":"chat.completion","created":1767284288,"model":"Qwen/Qwen3-4B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOkay, the user wants a one-sentence explanation of quantum computing. Let me start by recalling what I know. Quantum computing uses qubits, which can be in superposition, so they can represent 0 and 1 simultaneously. That's different from classical bits. Also, entanglement is a key aspect, where qubits are linked and the state of one affects the other. Then there's quantum interference, which helps in amplifying correct answers and canceling out wrong ones. The goal is to solve certain problems much faster than classical computers, like factoring large numbers or simulating molecules. But I need to keep it concise. Maybe start with \"Quantum computing leverages...\" and mention qubits, superposition, entanglement, and the potential for exponential speedup. But how to fit all that into one sentence without being too technical? Maybe something like: \"Quantum computing uses qubits that exist in superpositions of states and entanglement to perform complex calculations exponentially faster than classical computers for specific problems.\" Wait, does that cover the main points? Maybe mention the purpose, like solving certain problems efficiently. Let me check if that's accurate. Yeah, that seems right. But maybe \"exponential speedup\" is important. Also, mention that it's for specific problems, since not all problems benefit from it. Okay, that should work.\n</think>\n\nQuantum computing leverages qubits that exist in superpositions of states and entanglement to perform complex calculations exponentially faster than classical computers for specific problems like factoring large numbers or simulating molecules.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":338,"completion_tokens":322,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

This guide demonstrates how to implement the vLLM Observability Stack on a DGX Spark using Prometheus and Grafana. These modules gather critical metrics such as TTFT (Time-To-First-Token), ITL (Inter-Token Latency), and Throughput, providing real-time insights into your model’s performance.

Navigate to the observability directory and execute the installation script. This script deploys the kube-prometheus-stack, which includes Prometheus, Grafana, and the necessary operators.

cd observability
bash install.sh

Output

NAME: kube-prom-stack
LAST DEPLOYED: Thu Jan  1 23:27:27 2026
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
  kubectl --namespace monitoring get pods -l "release=kube-prom-stack"

Get Grafana 'admin' user password by running:

  kubectl --namespace monitoring get secrets kube-prom-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo

Access Grafana local instance:

  export POD_NAME=$(kubectl --namespace monitoring get pod -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kube-prom-stack" -oname)
  kubectl --namespace monitoring port-forward $POD_NAME 3000

Get your grafana admin user password by running:

  kubectl get secret --namespace monitoring -l app.kubernetes.io/component=admin-secret -o jsonpath="{.items[0].data.admin-password}" | base64 --decode ; echo


Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
Release "prometheus-adapter" does not exist. Installing it now.
^[[ANAME: prometheus-adapter
LAST DEPLOYED: Thu Jan  1 23:28:42 2026
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
prometheus-adapter has been deployed.
In a few minutes you should be able to list metrics using the following command(s):

  kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1

Run this command to map the Grafana service to the DGX’s local port 3000

kubectl port-forward -n monitoring svc/kube-prom-stack-grafana 3000:80 --address 0.0.0.0

Establish an SSH Tunnel from your Remote PC

ssh -L 3000:localhost:3000 spark@<DGX-IP-ADDRESS>

Open your browser on your Remote PC. Navigate to: http://127.0.0.1:3000

  • User: admin
  • Password: (Check your values.yaml, likely prom-operator)

Import the vLLM Dashboard

Open vLLM dashboard

To verify the observability stack is working, run a benchmark to generate traffic. This test sends 1,000 requests with a concurrency of 100 to the Qwen3 model.

vllm bench serve \
    --backend vllm \
    --base-url http://localhost:30080 \
    --model Qwen/Qwen3-4B \
    --dataset-name random \
    --random-input-len 128 \
    --random-output-len 128 \
    --num-prompts 1000 \
    --max-concurrency 100 \
    --temperature 0.7

Output of bench:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             100       
Benchmark duration (s):                  175.69    
Total input tokens:                      128000    
Total generated tokens:                  128000    
Request throughput (req/s):              5.69      
Output token throughput (tok/s):         728.54    
Peak output token throughput (tok/s):    850.00    
Peak concurrent requests:                195.00    
Total token throughput (tok/s):          1457.08   
---------------Time to First Token----------------
Mean TTFT (ms):                          814.82    
Median TTFT (ms):                        750.91    
P99 TTFT (ms):                           1769.83   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          131.70    
Median TPOT (ms):                        131.66    
P99 TPOT (ms):                           135.67    
---------------Inter-token Latency----------------
Mean ITL (ms):                           131.70    
Median ITL (ms):                         126.25    
P99 ITL (ms):                            421.22    
==================================================

Dashboard view:

If your dashboard shows “No Data” for GPU Cache Usage or GPU KV Cache Hit Rate, it is likely due to the nvidia-device-plugin crashing. It seems that there is no official support of Device plugin on DGX Spark: does device plugin support GB10 (NVIDIA DGX Spark) · Issue #1482 · NVIDIA/k8s-device-plugin · GitHub

Thanks for sharing @shahizat! This is great