Hello all,
In this guide, I provide instructions for setting up a local Kubernetes environment on NVIDIA DGX Spark. We will use Minikube to deploy LLMs such as Qwen3-4B using vLLM from NGC(25.12-py3).
Install Minikube
# Download the ARM64 binary
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-arm64
sudo install minikube-linux-arm64 /usr/local/bin/minikube
Install Kubectl
# Download the ARM64 binary for kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/arm64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
# Set up an alias and update your shell configuration
echo 'alias kubectl="minikube kubectl --"' >> ~/.bashrc
source ~/.bashrc
Start Minikube using the Docker driver. The --gpus=all flag allows the cluster to access the DGX Spark GPU.
minikube start \
--driver=docker \
--container-runtime=docker \
--gpus=all \
--cpus=max \
--memory=max
Before deploying the model, verify that the NVIDIA runtime is working within the cluster.
# Run a temporary GPU test pod
kubectl run gpu-test --image=nvidia/cuda:13.1.0-runtime-ubuntu22.04 --restart=Never -- nvidia-smi
# Check the logs to see the Blackwell (GB10) GPU status
kubectl logs gpu-test
Output
Create pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: qwen3-4b-storage
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50G
Create secret.yaml. Replace YOUR_ACTUAL_TOKEN with your Hugging Face Read Token.
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
type: Opaque
stringData:
token: "YOUR_ACTUAL_TOKEN"
Apply these to the cluster:
kubectl apply -f pvc.yaml
kubectl apply -f secret.yaml
Deploy Qwen3-4B using vLLM
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen3-4b
spec:
replicas: 1
selector:
matchLabels:
app: qwen3-4b
template:
metadata:
labels:
app: qwen3-4b
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: qwen3-4b-storage
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: qwen3-4b
image: nvcr.io/nvidia/vllm:25.12-py3
command: ["/bin/sh", "-c"]
args: [
"vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --trust-remote-code --max-model-len 32768 --gpu-memory-utilization 0.45"
]
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
volumeMounts:
- name: cache-volume
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
Apply the deployment:
kubectl apply -f deployment.yaml
Once the pod is running (it may take a few minutes to download the model), use port-forwarding to access the API locally.
kubectl port-forward svc/qwen3-4b 8000:8000
Test the Inference
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-4B",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the Grace Blackwell architecture on DGX Spark."}
],
"max_tokens": 100
}'



