NeuralForge GPU Native Knowledge Intelligence Platform Built on DGX Spark GB10

Hi everyone. I have been building and running NeuralForge on my DGX Spark GB10 daily for the last 6 months and just open sourced it. Wanted to share it with the Spark community since everything was built and tested on this hardware.

NeuralForge is a knowledge intelligence platform that ingests expert content at scale, builds a GPU accelerated knowledge graph, and serves answers through any OpenAI compatible tool. It runs entirely on the Spark with zero cloud dependencies.

What it does on the Spark:

Gemma 4 26B A4B running via NIM with TensorRT LLM for chat and classification inference. 43 tok/s on GB10 with 17GB GPU memory usage.

nomic embed text running via Triton Inference Server for batch embedding at 1000+ chunks per second with dynamic batching. Currently serving a 486K chunk knowledge base from 80+ AI and ML experts.

RAPIDS cuGraph running GPU accelerated knowledge graph operations. PageRank, community detection, shortest path, and 3 hop traversal across 500K nodes in 200ms. The graph persists as Parquet files and loads into GPU memory in under 2 seconds.

NeMo Guardrails in library mode for input and output safety rails including hallucination detection against the knowledge graph and mandatory expert attribution.

Total GPU memory footprint is about 28GB out of 128GB unified memory leaving plenty of room for the knowledge graph and additional models.

Things I learned building on the Spark that might help others:

KV cache quantization with q4_0 is counterproductive on GB10. The unified memory architecture means the dequantization workspace plus metadata overhead exceeds the savings from storing int4 instead of f16. q8_0 provides genuine 2x compression benefit. q4_0 provides neither memory savings nor speed. I published this finding earlier and it got some discussion here and on r/LocalLLaMA.

Triton dynamic batching matters enormously for bulk ingestion. I ingested 49,000 blog articles today. Sequential embedding with a single HTTP server processed about 20 chunks per second. Triton with dynamic batching pushes past 1000 chunks per second on the same hardware.

cuGraph is dramatically overkill for small graphs but perfect once you scale. At 500K nodes and 5M edges the entire graph fits in about 200MB of GPU memory. Operations that took 45 seconds on SQLite take 200ms on cuGraph. The Parquet save and reload cycle takes under 2 seconds at this scale.

The unified memory on GB10 changes how you think about model serving. I run 3 models simultaneously (26B chat, embedding, and a 31B vision model) totaling about 42GB. On a discrete GPU system you would need to carefully partition VRAM. On Spark the unified memory just handles it.

Features:

Temporal knowledge graph that tracks expert relationships with valid from and valid to dates. Knows who agrees, who contradicts, and what changed over time.

Layered context loading with 4 tiers scaled to token budget. Identity prompt, graph enriched expert rankings, compressed chunks, and deep search.

Fact preserving text compression at 2 to 3x ratio for fitting more expert knowledge into smaller context windows.

17 tool MCP server for Claude Code, Cursor, and other MCP compatible tools.

OpenAI compatible proxy at /v1/chat/completions that auto injects knowledge into any conversation. Point Open WebUI or any OpenAI SDK app at it and they get smarter answers without knowing the knowledge system exists.

Conversation mining from Claude and ChatGPT exports. Auto capture from coding sessions.

Blog scraping with multi strategy discovery. Document ingestion for PDF, DOCX, TXT, HTML.

Auto discovery worker that uses the LLM every 6 hours to classify expert pairs and build graph relationships automatically.

919 tests. Apache 2.0. Deploy in one command with docker compose up.

GitHub: GitHub - NathanMaine/neuralforge: GPU-native knowledge intelligence platform built on 6 NVIDIA technologies. Your experts. Your GPU. Your data never leaves. Ā· GitHub

Happy to answer questions about the architecture or share more details about running this stack on the Spark. If anyone is building something similar I would love to compare notes.

I will need to review the project but aspects sound quite similar to something I’ve been working on. cuGraph is especially interesting. I’ve so far been building in Postgres with Apache AGE and pgvector. Also curious about your temporal take as it seems like Graphiti’s. However, same implementation regarding the proactive proxy injection, only mine is pointed at the Opencode API currently.

We should likely go to PMs.

Hey @jrsphd, appreciate the response and sorry for the slow turnaround. This one slipped past me for a week.

The overlap does sound real. On the temporal side, NeuralForge tracks expert relationships with valid from and valid to on each edge. It’s simpler than Graphiti’s full bitemporal approach (no transaction time), but it handles the ā€œwho agreed and when did they change their mindā€ problem which was my main use case. Curious if the simpler model held up for you or if you went full bitemporal.

On Apache AGE + pgvector, that’s a stack I looked at seriously before picking cuGraph and Qdrant. Would love to hear how query performance has held up as the graph grows and how you’re handling the hybrid retrieval layer.

Sending a PM now with some more specific questions on the proactive proxy injection side. Good to hear someone else converged on the same pattern there.

Please don’t hesitate to share more about your discussions - I am super interested in this kind of projects (have it in personal backlog). And as my experience/knowledge is yet quite limited in the area, the thoughts/engineering process/discussion is very valuable for me in order to learn šŸ§‘ā€šŸŽ“šŸ˜‡

Thanks for the interest @_piotr3k. The NeuralForge repo is public if you want to browse the code: https://github.com/NathanMaine/neuralforge. It is a GPU-native knowledge intelligence platform built on six NVIDIA technologies (NIM, TensorRT-LLM, Triton, NeMo Guardrails, RAPIDS cuGraph, CUDA) with a single-command deploy. Current focus is bulk ingestion pipelines (YouTube, blogs, PDFs) and cuGraph-based graph traversal. Happy to answer specific questions if any particular angle is useful for your backlog.