Any plans to add a second Connect-X7 port to serial stack multiple DGX Spark clusters?

maiia · September 8, 2025, 10:02pm

Will the DGX Spark support serial stacking through ConnectX-7 beyond 2 units?

I understand the flagship Founder model is launching with one ConnectX-7 port in the back, will we see unified stacked cluster modularity via serial ConnectX-7 to expand beyond dual-cluster performance?

Also, I assume LPDDR5x, unified system memory is ECC? It’s not mentioned explicitly in the specs, but I assume since it’s LPDDR5x it should be ECC?

NVES · September 9, 2025, 1:44am

Memory: 128GB of coherent, unified LPDDR5X Low-Power Memory
Will not have ECC.

Connect-X (200G x 2 QSFP). We support stacking of up to 2 DGX Spark with an approved QSFP cable at this time.

maiia · September 9, 2025, 2:33am

If that’s the case, there’s marketing content from NVIDIA or NVIDIA Partners that infers that the DGX Spark GB10 Superchip provides ECC support. If they do not this makes these units a serious risk for model training.

Re: DGX Spark stacking – can we order our Sparks with unlocked firmware?
Since each Spark box has two Connect-X7 ports in the back, with open access firmware we can theoretically program the Connect-X7s to read serial connections beyond the guaranteed dual-core stack at our own risk.

max187 · September 10, 2025, 6:43pm

The lack of ECC memory in the DGX Spark is a surprising choice for a device intended for training, as it introduces significant risks.

The primary concern is silent data corruption. In our work, an uncorrected bit-flip can subtly flaw a model’s weights, causing a seemingly functional model to produce unreliable results that compromise downstream simulations in Omniverse.

A memory error could cause a perfectly valid fine-tuning approach to fail. A developer might then incorrectly discard a good idea, assuming their engineering was flawed rather than the hardware.

The DGX Spark should excel at rapid prototyping. However, once a final prototype is selected, its exact training process must be replicated on an ECC-protected system. This final “certification run” is a mandatory step to guarantee the model’s data integrity before deployment.

Thank you Maiia for bringing this to our attention!

maiia · September 10, 2025, 7:08pm

It surprised me at first but when I researched why NVIDIA pushed most boundaries with the unit and didn’t add this feature I realized that the memory architecture and power bandwidth limited the addition of a proper ECC capability.

They would need to consider factoring ECC in for the 2027 GB20 and potential DGX Spark 2.0 new architecture build.

But for this model (rumored to start shipping late September) we’d just have to be aware of the limitations like you’ve clearly articulated.

I’ll be taking mitigation steps like:

Segmenting model size e.g. 70B vs. the full 405B in one go
Implement refined training guardrails for complex models
Adding layered reviews to confirm validity and accuracy
Use an ECC-protected secure ecosystem as a validator or finisher
& before anything be hyper selective of the data used in training to minimize errors that would trigger any massive reworking.

It’s worth noting that while the DGX Spark doesn’t come equipped with an ECC-enabled chip, the LPDDR5X operates with ‘some’ error detecting ECC capabilities e.g. on-die ECC that corrects errors within the memory chip itself. Still this capability doesn’t remove the risk, but I’m placing it at moderate to high vs. extreme risk after further research.

Also, all the other tradeoffs doesn’t make the lack of a proper ECC a dealbreaker, at least for me.

max187 · September 10, 2025, 10:25pm

The headline feature of the Dual DGX Spark Cluster is staggering: the ability to run inference on a 470B parameter model on your desk. This leads to the assumption that it can easily handle a full fine-tune of a much smaller 70B model. But this is not going to happen because the memory requirements are at least 840 GB (140 GB weights + 140 GB gradients + 560 GB optimizer states). That leads us to want to know what will run on this new appliance?

Single DGX Spark (128 GB):

230B: Inference (4-bit)

210B: PEFT (LoRA, 4-bit base)

14B: Full Fine-Tune

Dual DGX Spark Cluster (256 GB):

470B: Inference (4-bit)

430B: PEFT (LoRA, 4-bit base)

26B: Full Fine-Tune

But what about the ConnectX-7 cable connecting the two Sparks? It’s easy to assume an NVIDIA-designed link would be seamless, but for the specific task of distributed training, this is where a bottleneck emerges. A single DGX Spark has a massive, 24-lane data interconnect between its CPU and GPU, running at an incredible ~600 GB/s. However, the ConnectX-7 link that connects two Sparks runs at only ~25 GB/s. This dramatic drop in bandwidth creates a perpetual data traffic jam when used for full fine-tuning, which requires constant, massive data synchronization. Time estimates show the impact: a 10B full fine-tune on a single Spark takes ~3.3 hours per epoch, while a 26B model on the dual cluster takes ~6.1 hours. For our usage, this time difference is an acceptable trade-off.

(Note: All calculations are theoretical estimates for planning purposes.)

Your mitigated workflow looks like proper insurance, and I like your risk assessment of using the Spark.

The GB10 is incredible for sure, despite the ECC drawback.

maiia · September 11, 2025, 2:53am

Max you raise several valid points and I did not factor in the heavy resource lift required fine-tuning.

Assuming the DGX OS and System Services take up between 19-32GB overhead per node depending on components.

Add the AI Framework Requirements another 12-32GB overhead per node.

That reframes the actual available memory to a conservative 65-98GB with actual ~80GB usable for the single node and for the dual node between 130-196GB so we may be looking at ~160GB with light OS & services installed.

I’m conservatively revising my expectations downward to an
8-10B Full Fine tune on a single node
16-20B Full Fine tune on a cluster

I’ll be aiming for 10B model size as my ceiling based on the relative (non-absolute) assumptions below for training and subsequent staging/inference:

Node 1 (Training Dedicated):

OS overhead: 30GB
Training framework: 20GB
Available for training: 78GB
Ceiling: 10B parameter full fine-tune

Node 2 (Staging/Inference):

OS overhead: 30GB
Inference framework: 10GB
Available for models: 88GB
Capacity: 175B parameter inference

I’m sure as we start receiving the units we’ll be sharing within the community our experiences and help each other along optimize our creations.

maiia · September 23, 2025, 11:19pm

Adding this parallel thread in case some of the DGX Spark Forum subscribers have thoughts

troys0604 · October 13, 2025, 4:07pm

I think it should support serial stack more than 2 Spark as it have 2 200G CX-7 ports, but how to assign work loading is a problem. NV may not provide official tools.

Besides, per your another topic, I’m agree that connect more than 2 Spark through Switch is also another solution, and I do really think if will have a 3rd party code to support task assign well, it have chance to use a cheaper lower speed Swith instead of expensive NV 200G port swith like QM8700, because we may not really need 200G speed data transfer. Do you know some other switch have less quantity 200G port(or 100G port but more cheaper and smaller?)

NVES · October 14, 2025, 9:32pm

Ethernet is the underlying protocol; clustering more than two Spark units is supported with compatible QSFP cables and Ethernet switches.

combacsa · October 17, 2025, 12:07am

Dear @NVES could you please confirm that if sooner or later such setup illustration for configuring more than two DGX Spark with QSFP ethernet switch will be provided? I’d like to know if it would be worth having 4 DGX Spark than 2 for accelerated inference and/or more memory-size-bounded usecases; i.e. specifically, enabling 681B models to be inferenced using 4 DGX Spark. Thanks.

NVES · October 17, 2025, 3:25pm

Connecting Spark CX7 to a QSFP Ethernet switch is the same process as with any other QSFP-capable device. No Spark-specific adjustments are needed since it’s standard Ethernet.

Topic		Replies	Views
The fastest platform of GPU computing CUDA Programming and Performance	38	40412	January 19, 2010
Best solution for maximizing bandwidth? More then 5.7G H->D bandwidth except Tesla CUDA Programming and Performance	24	11141	December 26, 2008
Why Tesla? CUDA Programming and Performance	27	33776	November 20, 2008
very large data set (big matrix) CUDA Programming and Performance	10	3067	October 17, 2009
compitable servers for S1070 collect some information CUDA Programming and Performance	20	27560	August 8, 2011
Server Motherboards for mulit-GPU systems (&Fermi) CUDA Programming and Performance	26	21246	November 12, 2009
Tesla 20-Series Features and Advantages CUDA Programming and Performance	65	152140	December 21, 2010
Connecting 2 or more Tesla S U1 series is it possible? CUDA Programming and Performance	4	2401	July 28, 2008
GPU cluster CUDA Programming and Performance	16	13527	January 9, 2008
4 Tesla C1060 + Quadro Plex 2200D2 CUDA Programming and Performance	0	1281	September 12, 2009

Any plans to add a second Connect-X7 port to serial stack multiple DGX Spark clusters?

Related topics