Will the DGX Spark support serial stacking through ConnectX-7 beyond 2 units?
I understand the flagship Founder model is launching with one ConnectX-7 port in the back, will we see unified stacked cluster modularity via serial ConnectX-7 to expand beyond dual-cluster performance?
Also, I assume LPDDR5x, unified system memory is ECC? It’s not mentioned explicitly in the specs, but I assume since it’s LPDDR5x it should be ECC?
If that’s the case, there’s marketing content from NVIDIA or NVIDIA Partners that infers that the DGX Spark GB10 Superchip provides ECC support. If they do not this makes these units a serious risk for model training.
Re: DGX Spark stacking – can we order our Sparks with unlocked firmware?
Since each Spark box has two Connect-X7 ports in the back, with open access firmware we can theoretically program the Connect-X7s to read serial connections beyond the guaranteed dual-core stack at our own risk.
The lack of ECC memory in the DGX Spark is a surprising choice for a device intended for training, as it introduces significant risks.
The primary concern is silent data corruption. In our work, an uncorrected bit-flip can subtly flaw a model’s weights, causing a seemingly functional model to produce unreliable results that compromise downstream simulations in Omniverse.
A memory error could cause a perfectly valid fine-tuning approach to fail. A developer might then incorrectly discard a good idea, assuming their engineering was flawed rather than the hardware.
The DGX Spark should excel at rapid prototyping. However, once a final prototype is selected, its exact training process must be replicated on an ECC-protected system. This final “certification run” is a mandatory step to guarantee the model’s data integrity before deployment.
Thank you Maiia for bringing this to our attention!
It surprised me at first but when I researched why NVIDIA pushed most boundaries with the unit and didn’t add this feature I realized that the memory architecture and power bandwidth limited the addition of a proper ECC capability.
They would need to consider factoring ECC in for the 2027 GB20 and potential DGX Spark 2.0 new architecture build.
But for this model (rumored to start shipping late September) we’d just have to be aware of the limitations like you’ve clearly articulated.
I’ll be taking mitigation steps like:
Segmenting model size e.g. 70B vs. the full 405B in one go
Implement refined training guardrails for complex models
Adding layered reviews to confirm validity and accuracy
Use an ECC-protected secure ecosystem as a validator or finisher
& before anything be hyper selective of the data used in training to minimize errors that would trigger any massive reworking.
It’s worth noting that while the DGX Spark doesn’t come equipped with an ECC-enabled chip, the LPDDR5X operates with ‘some’ error detecting ECC capabilities e.g. on-die ECC that corrects errors within the memory chip itself. Still this capability doesn’t remove the risk, but I’m placing it at moderate to high vs. extreme risk after further research.
Also, all the other tradeoffs doesn’t make the lack of a proper ECC a dealbreaker, at least for me.
The headline feature of the Dual DGX Spark Cluster is staggering: the ability to run inference on a 470B parameter model on your desk. This leads to the assumption that it can easily handle a full fine-tune of a much smaller 70B model. But this is not going to happen because the memory requirements are at least 840 GB (140 GB weights + 140 GB gradients + 560 GB optimizer states). That leads us to want to know what will run on this new appliance?
Single DGX Spark (128 GB):
230B: Inference (4-bit)
210B: PEFT (LoRA, 4-bit base)
14B: Full Fine-Tune
Dual DGX Spark Cluster (256 GB):
470B: Inference (4-bit)
430B: PEFT (LoRA, 4-bit base)
26B: Full Fine-Tune
But what about the ConnectX-7 cable connecting the two Sparks? It’s easy to assume an NVIDIA-designed link would be seamless, but for the specific task of distributed training, this is where a bottleneck emerges. A single DGX Spark has a massive, 24-lane data interconnect between its CPU and GPU, running at an incredible ~600 GB/s. However, the ConnectX-7 link that connects two Sparks runs at only ~25 GB/s. This dramatic drop in bandwidth creates a perpetual data traffic jam when used for full fine-tuning, which requires constant, massive data synchronization. Time estimates show the impact: a 10B full fine-tune on a single Spark takes ~3.3 hours per epoch, while a 26B model on the dual cluster takes ~6.1 hours. For our usage, this time difference is an acceptable trade-off.
(Note: All calculations are theoretical estimates for planning purposes.)
Your mitigated workflow looks like proper insurance, and I like your risk assessment of using the Spark.
The GB10 is incredible for sure, despite the ECC drawback.
Max you raise several valid points and I did not factor in the heavy resource lift required fine-tuning.
Assuming the DGX OS and System Services take up between 19-32GB overhead per node depending on components.
Add the AI Framework Requirements another 12-32GB overhead per node.
That reframes the actual available memory to a conservative 65-98GB with actual ~80GB usable for the single node and for the dual node between 130-196GB so we may be looking at ~160GB with light OS & services installed.
I’m conservatively revising my expectations downward to an
8-10B Full Fine tune on a single node
16-20B Full Fine tune on a cluster
I’ll be aiming for 10B model size as my ceiling based on the relative (non-absolute) assumptions below for training and subsequent staging/inference:
Node 1 (Training Dedicated):
OS overhead: 30GB
Training framework: 20GB
Available for training: 78GB
Ceiling: 10B parameter full fine-tune
Node 2 (Staging/Inference):
OS overhead: 30GB
Inference framework: 10GB
Available for models: 88GB
Capacity: 175B parameter inference
I’m sure as we start receiving the units we’ll be sharing within the community our experiences and help each other along optimize our creations.
I think it should support serial stack more than 2 Spark as it have 2 200G CX-7 ports, but how to assign work loading is a problem. NV may not provide official tools.
Besides, per your another topic, I’m agree that connect more than 2 Spark through Switch is also another solution, and I do really think if will have a 3rd party code to support task assign well, it have chance to use a cheaper lower speed Swith instead of expensive NV 200G port swith like QM8700, because we may not really need 200G speed data transfer. Do you know some other switch have less quantity 200G port(or 100G port but more cheaper and smaller?)
Dear @NVES could you please confirm that if sooner or later such setup illustration for configuring more than two DGX Spark with QSFP ethernet switch will be provided? I’d like to know if it would be worth having 4 DGX Spark than 2 for accelerated inference and/or more memory-size-bounded usecases; i.e. specifically, enabling 681B models to be inferenced using 4 DGX Spark. Thanks.
Connecting Spark CX7 to a QSFP Ethernet switch is the same process as with any other QSFP-capable device. No Spark-specific adjustments are needed since it’s standard Ethernet.