I am EXTREMely disappointed with the current state of DGX Spark

Feel free to dump this into a gist or something, but I have to leave for an appointment. Here’s what I documented:

# ComfyUI Distributed — Dual-Node Setup

Short tutorial for adding a worker node to a ComfyUI master using the
[ComfyUI-Distributed](https://github.com/robertvoy/ComfyUI-Distributed) plugin.

Throughout this guide, replace the placeholders with your own values:

- `<user>` — the Linux username on both nodes (assumed to be the same)
- `<IP address node 1>` — routable IP of the master node
- `<IP address node 2>` — routable IP of the worker node
- `<worker-id>` — a short slug for the worker, e.g. `worker1`
- `<Worker Name>` — a human-readable label shown in the UI

## Topology

| Role   | Host                              | Port | Notes                          |
| ------ | --------------------------------- | ---- | ------------------------------ |
| Master | node 1 — `<IP address node 1>`    | 8188 | UI lives here, queues the work |
| Worker | node 2 — `<IP address node 2>`    | 8188 | Headless, returns results      |

Both nodes run an identical ComfyUI install. The master's plugin talks to the
worker over HTTP and pulls image bytes back when the worker finishes.

## 1. Prereqs (both nodes)

Install ComfyUI the same way in the same path on each node — this matters:
the master references workflows and model filenames by string, so anything
referenced must resolve identically on the worker.

```bash
# Run on BOTH nodes
python -m venv ~/comfyui-env
source ~/comfyui-env/bin/activate
git clone https://github.com/comfyanonymous/ComfyUI.git ~/ComfyUI
cd ~/ComfyUI
pip install -r requirements.txt
```

Keep `~/ComfyUI/models/` in sync between the two machines. rsync is the
simplest way:

```bash
# From node 1 → node 2
rsync -avh --progress ~/ComfyUI/models/ <user>@<IP address node 2>:~/ComfyUI/models/
```

## 2. Install the Distributed plugin (both nodes)

```bash
cd ~/ComfyUI/custom_nodes
git clone https://github.com/robertvoy/ComfyUI-Distributed.git
cd ComfyUI-Distributed
pip install -r requirements.txt   # if present
```

## 3. Launch ComfyUI on both nodes

The worker needs `--listen 0.0.0.0` so the master can reach it, and
`--enable-cors-header` so the master's web UI doesn't get blocked when
streaming worker previews.

```bash
# Run on BOTH nodes
source ~/comfyui-env/bin/activate
cd ~/ComfyUI
python main.py --listen 0.0.0.0 --port 8188 --enable-cors-header
```

For production, background them with `nohup ... &` or a systemd unit.

## 4. Register the worker on the master

Edit `~/ComfyUI/custom_nodes/ComfyUI-Distributed/gpu_config.json` **on the
master only**:

```json
{
  "master": {
    "host": "<IP address node 1>",
    "cuda_device": 0
  },
  "workers": [
    {
      "id": "<worker-id>",
      "name": "<Worker Name>",
      "type": "remote",
      "host": "<IP address node 2>",
      "port": 8188,
      "enabled": true
    }
  ],
  "settings": {
    "auto_launch_workers": false,
    "stop_workers_on_master_exit": false,
    "master_delegate_only": false
  }
}
```

Key fields:
- `master.host` — a routable IP of the master (not `localhost`). Workers must
  be able to reach it to upload result images back.
- `workers[].host` / `port` — where the master will POST jobs. Must match the
  `--listen` / `--port` of the worker's ComfyUI.
- `master_delegate_only: true` — if you want node 1 to only coordinate and
  never run a job itself. Leave `false` to use both GPUs.

Restart the master's ComfyUI so the plugin re-reads the config.

## 5. Use it in a workflow

Inside the ComfyUI UI on the master, open the **Distributed** side panel
(added by the plugin). Your worker should appear as `<Worker Name>` with a
green status dot. If it's red, check that the worker URL is reachable:
`curl http://<IP address node 2>:8188/system_stats`.

Then in the workflow itself, replace two nodes:

1. Swap the normal `KSampler`'s seed input for a **Distributed Seed** node.
2. Insert a **Distributed Collector** between the VAE decode and the
   `SaveImage` / `PreviewImage` node.

When you hit **Queue**, the Distributed Seed hands each enabled worker a
different seed, each machine renders its own image in parallel, and the
Collector gathers them all back on the master. You'll see N images come out
of one queue run, where N = 1 master + enabled workers.

The `distributed-txt2img.json` and `cyberrealistic-full-distributed.json`
files in this folder are ready-to-load examples of that wiring.

## 6. Driving it from the API

For scripted batches, skip `/prompt` and POST to the plugin's endpoint:

```bash
curl -X POST http://<IP address node 1>:8188/distributed/queue \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": <workflow JSON>,
        "enabled_worker_ids": ["<worker-id>"]
      }'
```

Results for both master and worker come back in the master's
`/history` / output folder as if they were all produced locally.

## Troubleshooting

- **Worker shows red in UI** — the plugin can't reach the worker host/port.
  `curl` the worker's `/system_stats` from the master. If that works but the
  plugin still fails, restart the master ComfyUI so it re-reads
  `gpu_config.json`.
- **Worker runs job but master shows "missing image"** — the worker couldn't
  reach `master.host` to upload its result. Set `master.host` to an IP the
  worker can actually route to, not `localhost` or `127.0.0.1`.
- **"model not found" on worker only** — the master referenced a checkpoint
  that only exists on node 1. Rsync `models/` again.
- **Only node 1 ever runs jobs** — you're hitting `/prompt` instead of
  `/distributed/queue`, or the workflow is missing the Distributed Seed /
  Collector nodes.