Hi everyone. I have been building and running NeuralForge on my DGX Spark GB10 daily for the last 6 months and just open sourced it. Wanted to share it with the Spark community since everything was built and tested on this hardware.
NeuralForge is a knowledge intelligence platform that ingests expert content at scale, builds a GPU accelerated knowledge graph, and serves answers through any OpenAI compatible tool. It runs entirely on the Spark with zero cloud dependencies.
What it does on the Spark:
Gemma 4 26B A4B running via NIM with TensorRT LLM for chat and classification inference. 43 tok/s on GB10 with 17GB GPU memory usage.
nomic embed text running via Triton Inference Server for batch embedding at 1000+ chunks per second with dynamic batching. Currently serving a 486K chunk knowledge base from 80+ AI and ML experts.
RAPIDS cuGraph running GPU accelerated knowledge graph operations. PageRank, community detection, shortest path, and 3 hop traversal across 500K nodes in 200ms. The graph persists as Parquet files and loads into GPU memory in under 2 seconds.
NeMo Guardrails in library mode for input and output safety rails including hallucination detection against the knowledge graph and mandatory expert attribution.
Total GPU memory footprint is about 28GB out of 128GB unified memory leaving plenty of room for the knowledge graph and additional models.
Things I learned building on the Spark that might help others:
KV cache quantization with q4_0 is counterproductive on GB10. The unified memory architecture means the dequantization workspace plus metadata overhead exceeds the savings from storing int4 instead of f16. q8_0 provides genuine 2x compression benefit. q4_0 provides neither memory savings nor speed. I published this finding earlier and it got some discussion here and on r/LocalLLaMA.
Triton dynamic batching matters enormously for bulk ingestion. I ingested 49,000 blog articles today. Sequential embedding with a single HTTP server processed about 20 chunks per second. Triton with dynamic batching pushes past 1000 chunks per second on the same hardware.
cuGraph is dramatically overkill for small graphs but perfect once you scale. At 500K nodes and 5M edges the entire graph fits in about 200MB of GPU memory. Operations that took 45 seconds on SQLite take 200ms on cuGraph. The Parquet save and reload cycle takes under 2 seconds at this scale.
The unified memory on GB10 changes how you think about model serving. I run 3 models simultaneously (26B chat, embedding, and a 31B vision model) totaling about 42GB. On a discrete GPU system you would need to carefully partition VRAM. On Spark the unified memory just handles it.
Features:
Temporal knowledge graph that tracks expert relationships with valid from and valid to dates. Knows who agrees, who contradicts, and what changed over time.
Layered context loading with 4 tiers scaled to token budget. Identity prompt, graph enriched expert rankings, compressed chunks, and deep search.
Fact preserving text compression at 2 to 3x ratio for fitting more expert knowledge into smaller context windows.
17 tool MCP server for Claude Code, Cursor, and other MCP compatible tools.
OpenAI compatible proxy at /v1/chat/completions that auto injects knowledge into any conversation. Point Open WebUI or any OpenAI SDK app at it and they get smarter answers without knowing the knowledge system exists.
Conversation mining from Claude and ChatGPT exports. Auto capture from coding sessions.
Blog scraping with multi strategy discovery. Document ingestion for PDF, DOCX, TXT, HTML.
Auto discovery worker that uses the LLM every 6 hours to classify expert pairs and build graph relationships automatically.
919 tests. Apache 2.0. Deploy in one command with docker compose up.
Happy to answer questions about the architecture or share more details about running this stack on the Spark. If anyone is building something similar I would love to compare notes.