I’ve been refining a local AI agent architecture designed for real-world software engineering. After some trial and error with Docker isolation and memory bottlenecks, I wanted to share the specific hybrid setup that’s actually proving usable.
The Architecture: Decoupling Reasoning from Retrieval
The core philosophy here is separating the “Cognition” from the “Memory” to solve the latency issues caused by loading massive context into a large model every time it runs.
-
The AI Gateway (Docker): This container runs the heavy reasoning engine— llama-proxy.py + llama.cpp serving the Qwen 3.5 35B-A3B model.
-
The AI Agent (OpenClaw) & Orchestrator (Docker): A separate containerized environment for the workspace, project files, and the Orchestrator logic.
-
The “Second Brain” (Host-Side): I run a secondary embedding model (like nomic-embed-text) via Ollama on the host, integrated with the QMD search engine.
By using the host-side embedded model for data retrieval, I avoid forcing the 35B Qwen model to ingest all memory data for every request, which significantly improves responsiveness.
Claw-Hybrid-Platform/
├── ai-agent/ # Logic Layer: OpenClaw
│ ├── ai-agent-logs/ # Persistent runtime logs
│ │ └── gateway.log # Logs for OpenClaw gateway activity
│ └── Dockerfile # Environment for agent logic
├── ai-gateway/ # Computing Layer: LLM & Reasoning Proxy
│ ├── logs/
│ │ └── llama-server.log # Raw output from llama-server
│ ├── proxy/
│ │ ├── config.yaml # Gateway/Proxy configurations
│ │ └── llama-proxy.py # OpenAI-compatible API wrapper for llama.cpp
│ └── Dockerfile # Environment for agent gateway
├── llama.cpp/ # Source: Cloned llama.cpp for local build
├── models/ # Model Storage: Save .gguf files here
├── persistfolder/ # Persistent Data(The “Soul” of the Agent)
│ ├── config/ # Holds openclaw.json and global settings
│ ├── memory/ # Memory storage (main.sqlite)
│ └── workspace/ # Projects, AGENT.md, MEMORY.md, daily log…etc
└── docker-compose.yml # Bridges Host GPU & Docker Containers
The 40GB Memory Challenge
Even on high-end hardware like the NVIDIA DGX Spark (120GB), memory management is the primary hurdle.
-
Baseline: The system and browser sit at ~5GB.
-
The Agent Stack: Spin up OpenClaw, an additional ~25GB, which hit 30GB.
-
Full Orchestration: Once the Orchestrator, Redis, and Celery workers are live, the stack hits 32~40GB before even reaching full load.
This high memory footprint is exactly why the hybrid Docker/Host split is necessary—it keeps the reasoning engine isolated while letting the retrieval engine run lean on the host.
The Workflow: Orchestrator + ClawMobile
The real value comes from the synergy between the Orchestrator and the ClawMobile app. It changes the dev process from “babysitting a terminal” to “asynchronous management.”
-
Orchestrator: Handles the heavy lifting—task queuing, multi-phase development, and background execution.
Task Management: A FastAPI backend with Celery and Redis handles asynchronous task queues, allowing for multi-phase development workflows (create, test, deploy).
Monitoring: Provides real-time WebSocket log streams and tool-tracking to audit every operation performed by the AI agents.
-
ClawMobile: Since it speaks the OpenClaw Gateway protocol, I can stay connected to the DGX Spark from anywhere. I can check the dashboard to see which tasks are in progress, failed, or completed in the Orchestrator background and provide real-time feedback to the agent while I’m away from my desk.
Secure Remote Access: Connects via Tailscale or LAN using Ed25519 authentication.
Mobile Supervision: Allows the user to track Orchestrator tasks and provide real-time feedback to OpenClaw, ensuring continuous improvement without being tethered to a laptop.
**
Resource Architecture Table**
| Layer | Components | Placement | Memory Impact |
|---|---|---|---|
| Cognition | Qwen 3.5 (35B), OpenClaw Gateway | Docker (AI-Gateway) | ~25GB |
| Retrieval | Ollama, Nomic-Embed, QMD v2 | Host Machine | Low (Optimized) |
| Management | FastAPI, Redis, Celery, SQLite | Docker (AI-Agent/Orchestrator) | ~2-10GB+ |
| Interface | Kotlin Android App / OpenClaw Dashboard or Orchestrator Frontend |
Mobile Device / Browser |
N/A / 500M-1GB |
**
Key Takeaways for Developers**
-
Don’t over-contextualize the LLM: Use a secondary, smaller embedding model for RAG/Retrieval to keep your main reasoning model fast.
-
Persistence: Use Docker Compose to mount host volumes for
/workspaceand/memoryso your agent’s “soul” survives a reboot. -
Infrastructure: Even with 120GB of VRAM, efficiency matters. Separating the Gateway from the Agent allows for better resource allocation.
This setup moves away from “AI as a chatbot” and toward “AI as an autonomous background process” that you manage via mobile.
Tips
In OpenClaw chatbox tag [gemini], [autoresearch], [think] keywords to get more functions.
Reference:





