NeuralForge GPU Native Knowledge Intelligence Platform Built on DGX Spark GB10

nmaine · April 8, 2026, 2:48am

Hi everyone. I have been building and running NeuralForge on my DGX Spark GB10 daily for the last 6 months and just open sourced it. Wanted to share it with the Spark community since everything was built and tested on this hardware.

NeuralForge is a knowledge intelligence platform that ingests expert content at scale, builds a GPU accelerated knowledge graph, and serves answers through any OpenAI compatible tool. It runs entirely on the Spark with zero cloud dependencies.

What it does on the Spark:

Gemma 4 26B A4B running via NIM with TensorRT LLM for chat and classification inference. 43 tok/s on GB10 with 17GB GPU memory usage.

nomic embed text running via Triton Inference Server for batch embedding at 1000+ chunks per second with dynamic batching. Currently serving a 486K chunk knowledge base from 80+ AI and ML experts.

RAPIDS cuGraph running GPU accelerated knowledge graph operations. PageRank, community detection, shortest path, and 3 hop traversal across 500K nodes in 200ms. The graph persists as Parquet files and loads into GPU memory in under 2 seconds.

NeMo Guardrails in library mode for input and output safety rails including hallucination detection against the knowledge graph and mandatory expert attribution.

Total GPU memory footprint is about 28GB out of 128GB unified memory leaving plenty of room for the knowledge graph and additional models.

Things I learned building on the Spark that might help others:

KV cache quantization with q4_0 is counterproductive on GB10. The unified memory architecture means the dequantization workspace plus metadata overhead exceeds the savings from storing int4 instead of f16. q8_0 provides genuine 2x compression benefit. q4_0 provides neither memory savings nor speed. I published this finding earlier and it got some discussion here and on r/LocalLLaMA.

Triton dynamic batching matters enormously for bulk ingestion. I ingested 49,000 blog articles today. Sequential embedding with a single HTTP server processed about 20 chunks per second. Triton with dynamic batching pushes past 1000 chunks per second on the same hardware.

cuGraph is dramatically overkill for small graphs but perfect once you scale. At 500K nodes and 5M edges the entire graph fits in about 200MB of GPU memory. Operations that took 45 seconds on SQLite take 200ms on cuGraph. The Parquet save and reload cycle takes under 2 seconds at this scale.

The unified memory on GB10 changes how you think about model serving. I run 3 models simultaneously (26B chat, embedding, and a 31B vision model) totaling about 42GB. On a discrete GPU system you would need to carefully partition VRAM. On Spark the unified memory just handles it.

Features:

Temporal knowledge graph that tracks expert relationships with valid from and valid to dates. Knows who agrees, who contradicts, and what changed over time.

Layered context loading with 4 tiers scaled to token budget. Identity prompt, graph enriched expert rankings, compressed chunks, and deep search.

Fact preserving text compression at 2 to 3x ratio for fitting more expert knowledge into smaller context windows.

17 tool MCP server for Claude Code, Cursor, and other MCP compatible tools.

OpenAI compatible proxy at /v1/chat/completions that auto injects knowledge into any conversation. Point Open WebUI or any OpenAI SDK app at it and they get smarter answers without knowing the knowledge system exists.

Conversation mining from Claude and ChatGPT exports. Auto capture from coding sessions.

Blog scraping with multi strategy discovery. Document ingestion for PDF, DOCX, TXT, HTML.

Auto discovery worker that uses the LLM every 6 hours to classify expert pairs and build graph relationships automatically.

919 tests. Apache 2.0. Deploy in one command with docker compose up.

GitHub: GitHub - NathanMaine/neuralforge: GPU-native knowledge intelligence platform built on 6 NVIDIA technologies. Your experts. Your GPU. Your data never leaves. · GitHub

Happy to answer questions about the architecture or share more details about running this stack on the Spark. If anyone is building something similar I would love to compare notes.

jrsphd · April 8, 2026, 3:25am

I will need to review the project but aspects sound quite similar to something I’ve been working on. cuGraph is especially interesting. I’ve so far been building in Postgres with Apache AGE and pgvector. Also curious about your temporal take as it seems like Graphiti’s. However, same implementation regarding the proactive proxy injection, only mine is pointed at the Opencode API currently.

We should likely go to PMs.

nmaine · April 15, 2026, 10:28pm

Hey @jrsphd, appreciate the response and sorry for the slow turnaround. This one slipped past me for a week.

The overlap does sound real. On the temporal side, NeuralForge tracks expert relationships with valid from and valid to on each edge. It’s simpler than Graphiti’s full bitemporal approach (no transaction time), but it handles the “who agreed and when did they change their mind” problem which was my main use case. Curious if the simpler model held up for you or if you went full bitemporal.

On Apache AGE + pgvector, that’s a stack I looked at seriously before picking cuGraph and Qdrant. Would love to hear how query performance has held up as the graph grows and how you’re handling the hybrid retrieval layer.

Sending a PM now with some more specific questions on the proactive proxy injection side. Good to hear someone else converged on the same pattern there.

_piotr3k · April 16, 2026, 1:38am

Please don’t hesitate to share more about your discussions - I am super interested in this kind of projects (have it in personal backlog). And as my experience/knowledge is yet quite limited in the area, the thoughts/engineering process/discussion is very valuable for me in order to learn 🧑‍🎓😇

nmaine · April 17, 2026, 9:16pm

Thanks for the interest @_piotr3k. The NeuralForge repo is public if you want to browse the code: https://github.com/NathanMaine/neuralforge. It is a GPU-native knowledge intelligence platform built on six NVIDIA technologies (NIM, TensorRT-LLM, Triton, NeMo Guardrails, RAPIDS cuGraph, CUDA) with a single-command deploy. Current focus is bulk ingestion pipelines (YouTube, blogs, PDFs) and cuGraph-based graph traversal. Happy to answer specific questions if any particular angle is useful for your backlog.

Topic		Replies	Views
DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference DGX Spark / GB10 Projects docker , spark , llm	9	1674	February 13, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2585	March 26, 2026
Why 273 GB/s? Less Is More, Until It Isn’t DGX Spark / GB10	67	2252	March 27, 2026
Qwen introduces FlashQLA - high-performance linear attention kernels built on TileLang DGX Spark / GB10 cuda , kernel	10	400	April 30, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	1922	December 22, 2025
DGX Spark GB10 / vLLM 0.19.1: TurboQuant KV cache integration results on Qwen3.5 and Nemotron, including gather-free Triton decode and CUDA WPH decode DGX Spark / GB10 Projects nemotron	5	1428	April 7, 2026
DGX Spark performance DGX Spark / GB10	50	4508	February 27, 2026
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	8780	March 31, 2026
Nemotron 3 Super: Updates Approaching Agentic Usability DGX Spark / GB10 llama , agentic-ai , nemotron	1	370	April 5, 2026
Code assist and rag (instruct) in single node DGX Spark / GB10 Projects	2	310	February 14, 2026

NeuralForge GPU Native Knowledge Intelligence Platform Built on DGX Spark GB10

Related topics