Local Kubernetes Cluster with K3s on Nvidia DGX Spark

Hello all, this guide outlines the steps to set up a K3s cluster using Docker as the container runtime with NVIDIA GPU support on Nvidia DGX Spark to serve the Qwen3-4B model using vLLM.

Before installing K3s, you must ensure Docker is configured to use the NVIDIA Container Runtime. This allows your Kubernetes pods to access the GPU hardware on Nvidia DGX Spark.

Edit /etc/docker/daemon.json and add the following configuration:

{
  "runtimes": {
    "nvidia": {
      "args": [],
      "path": "nvidia-container-runtime"
    }
  },
  "default-runtime": "nvidia"
}

Restart the Docker service to apply the new runtime settings:

sudo systemctl restart docker

By default, K3s uses containerd. Since we have configured Docker with the NVIDIA runtime, we must explicitly tell K3s to use Docker.

Standard Installation:

curl -sfL ``https://get.k3s.io`` | INSTALL_K3S_EXEC=“–docker --write-kubeconfig-mode 644 --disable=traefik” sh -

Installation with Custom DNS

curl -sfL ``https://get.k3s.io`` | INSTALL_K3S_EXEC=“–docker --write-kubeconfig-mode 644 --disable=traefik --resolv-conf /etc/k3s-dns.conf” sh -

Grant your user permission to access the cluster configuration and set the environment variable.

sudo chmod 644 /etc/rancher/k3s/k3s.yaml
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
echo “export KUBECONFIG=/etc/rancher/k3s/k3s.yaml” >> ~/.bashrc

Check if your node is ready:

NAME STATUS ROLES AGE VERSION
gx10-868a Ready control-plane,master 3m53s v1.33.6+k3s1

Then, create Persistent Volume Claim:

# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: qwen3-4b-storage
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50G

Create a secret to store your Hugging Face token for model access

# secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
stringData:
  token: "YOUR_ACTUAL_TOKEN"

Create Deployment and Service yaml file:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen3-4b
  labels:
    app: qwen3-4b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwen3-4b
  template:
    metadata:
      labels:
        app: qwen3-4b
    spec:
      volumes:
        - name: cache-volume
          persistentVolumeClaim:
            claimName: qwen3-4b-storage
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "2Gi"
      containers:
        - name: qwen3-4b
          image: nvcr.io/nvidia/vllm:25.12-py3
          command: ["/bin/sh", "-c"]
          args: [
            "vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --trust-remote-code --max-model-len 32768 --gpu-memory-utilization 0.45"
          ]
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token-secret
                  key: token
          ports:
            - containerPort: 8000
          resources:
            limits:
              cpu: "8"
              memory: 16Gi
              nvidia.com/gpu: 1  # Ensures the pod is scheduled on a GPU node
            requests:
              cpu: "4"
              memory: 8Gi
              nvidia.com/gpu: 1
          volumeMounts:
            - name: cache-volume
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: qwen3-4b
spec:
  ports:
    - name: http
      port: 8000
      targetPort: 8000
  selector:
    app: qwen3-4b
  type: ClusterIP

Apply these files:

kubectl apply -f pvc.yaml
kubectl apply -f secret.yaml
kubectl apply -f deployment.yaml

Nvidia-smi output:

Check logs of pod using below commands:

kubectl get pods
NAME                        READY   STATUS    RESTARTS   AGE
qwen3-4b-7fd7d4485d-j42fn   1/1     Running   0          38m

Then run:

kubectl logs qwen3-4b-7fd7d4485d-j42fn

Once the pod status is Running, find the Service IP:

kubectl get svc qwen3-4b

Replace the IP address below with the CLUSTER-IP retrieved from the command above:

curl http://<SERVICE_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain the Grace Blackwell architecture on DGX Spark."}
    ],
    "max_tokens": 100
  }'

Hope it helps.

3 Likes

Thanks for sharing. 👍

For anyone interested, you can also use Talos Linux with the 1.12 release to run Kubernetes on the Spark. It removes DGX Linux and replaces it with a newer kernel (6.18) and optimized environment for Kubernetes.

The guide for setting up the NVIDIA drivers and container runtime can be found here NVIDIA GPU (OSS drivers) - Sidero Documentation

1 Like

Thanks for sharing! Any luck getting the network-operator working with 2 DGX sparks?