Hello all, this guide outlines the steps to set up a K3s cluster using Docker as the container runtime with NVIDIA GPU support on Nvidia DGX Spark to serve the Qwen3-4B model using vLLM.
Before installing K3s, you must ensure Docker is configured to use the NVIDIA Container Runtime. This allows your Kubernetes pods to access the GPU hardware on Nvidia DGX Spark.
Edit /etc/docker/daemon.json and add the following configuration:
{
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
},
"default-runtime": "nvidia"
}
Restart the Docker service to apply the new runtime settings:
sudo systemctl restart docker
By default, K3s uses containerd. Since we have configured Docker with the NVIDIA runtime, we must explicitly tell K3s to use Docker.
Standard Installation:
curl -sfL ``https://get.k3s.io`` | INSTALL_K3S_EXEC=“–docker --write-kubeconfig-mode 644 --disable=traefik” sh -
Installation with Custom DNS
curl -sfL ``https://get.k3s.io`` | INSTALL_K3S_EXEC=“–docker --write-kubeconfig-mode 644 --disable=traefik --resolv-conf /etc/k3s-dns.conf” sh -
Grant your user permission to access the cluster configuration and set the environment variable.
sudo chmod 644 /etc/rancher/k3s/k3s.yaml
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
echo “export KUBECONFIG=/etc/rancher/k3s/k3s.yaml” >> ~/.bashrc
Check if your node is ready:
NAME STATUS ROLES AGE VERSION
gx10-868a Ready control-plane,master 3m53s v1.33.6+k3s1
Then, create Persistent Volume Claim:
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: qwen3-4b-storage
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50G
Create a secret to store your Hugging Face token for model access
# secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
type: Opaque
stringData:
token: "YOUR_ACTUAL_TOKEN"
Create Deployment and Service yaml file:
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen3-4b
labels:
app: qwen3-4b
spec:
replicas: 1
selector:
matchLabels:
app: qwen3-4b
template:
metadata:
labels:
app: qwen3-4b
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: qwen3-4b-storage
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: qwen3-4b
image: nvcr.io/nvidia/vllm:25.12-py3
command: ["/bin/sh", "-c"]
args: [
"vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --trust-remote-code --max-model-len 32768 --gpu-memory-utilization 0.45"
]
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "8"
memory: 16Gi
nvidia.com/gpu: 1 # Ensures the pod is scheduled on a GPU node
requests:
cpu: "4"
memory: 8Gi
nvidia.com/gpu: 1
volumeMounts:
- name: cache-volume
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: qwen3-4b
spec:
ports:
- name: http
port: 8000
targetPort: 8000
selector:
app: qwen3-4b
type: ClusterIP
Apply these files:
kubectl apply -f pvc.yaml
kubectl apply -f secret.yaml
kubectl apply -f deployment.yaml
Nvidia-smi output:
Check logs of pod using below commands:
kubectl get pods
NAME READY STATUS RESTARTS AGE
qwen3-4b-7fd7d4485d-j42fn 1/1 Running 0 38m
Then run:
kubectl logs qwen3-4b-7fd7d4485d-j42fn
Once the pod status is Running, find the Service IP:
kubectl get svc qwen3-4b
Replace the IP address below with the CLUSTER-IP retrieved from the command above:
curl http://<SERVICE_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-4B",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the Grace Blackwell architecture on DGX Spark."}
],
"max_tokens": 100
}'
Hope it helps.
