Deterministic Pre-Collapse Signal for NCCL Silent Stall Detection (~t-30 lead time)

Body

Hi NVIDIA HPC team,

I’m sharing an observation from distributed training logs that may be relevant to NCCL-based multi-GPU systems.

Problem

In large-scale distributed training, we occasionally observe silent stalls where:

  • GPU utilization remains at ~100%

  • No immediate hardware failure is reported

  • Training progress stops due to NCCL synchronization issues or straggler effects

These cases are typically only detected after timeout or manual inspection.


Observation

From analysis of historical failure logs (GTX 1070-based experiments), we observed a consistent pre-collapse pattern:

  • A structural deviation signal begins increasing approximately ~30 steps before full stall

  • Standard telemetry (GPU util, memory usage, temperature) remains stable during this period

  • The failure manifests later as a full synchronization breakdown

This suggests that the failure is not abrupt, but preceded by a measurable state-space deformation.


Approach (Conceptual)

We compute a deterministic deviation score by comparing runtime telemetry against a baseline distribution of normal execution behavior.

Key properties:

  • No machine learning model

  • No training phase

  • Purely deterministic transformation of telemetry streams

  • Designed as a passive diagnostic layer (no runtime intervention)


Question

Has anyone observed similar pre-collapse structural signals in NCCL-based distributed training systems?

Specifically:

  • Are silent stalls known to have measurable precursors at the runtime layer (not just hardware metrics)?

  • Is there any existing work in NCCL / CUDA runtime that models failure as gradual state drift rather than discrete events?

Any feedback on whether this kind of signal aligns with known failure modes in large-scale GPU clusters would be appreciated.


Thanks in advance for any insights.

Yes — we’ve hit exactly this on a 2-node GB10 (DGX Spark) FSDP setup. Util pegged at 100%, no hardware fault logged, training just quietly stops, and the only thing that eventually surfaces it is the NCCL watchdog firing on a collective timeout well after the actual stall. The “only detected after timeout” part matches us completely.

We never got as far as a real pre-collapse signal like yours — our mitigation was reactive (bumping TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC so the watchdog stopped false-firing under load). But one thing we noticed, at least on unified-memory hardware: our stalls seemed to cluster late in long training sessions, after the allocator had been fragmenting for a while. We never proved causation, but the timing lined up — and standard telemetry (util, temp, mem usage) looked fine right up until the collective hung, same as you describe. The drift, if it was there, was in allocator state, not anything NCCL was reporting.

Caveat that your GTX 1070 runs have discrete VRAM and ours is GB10 unified memory, so that precursor may not transfer. But if you’ve got allocator/UMA stats in your telemetry streams, might be worth checking whether they deform on the same ~30-step lead as your signal.

Draft:

We observed a consistent non-linear structure in collapse probability across BS × HD parameter space.

In particular, we see:

  • a stable plateau region (high collapse probability),

  • a sharp transition band,

  • and a high-variance drop-off region beyond a certain HD threshold.

These patterns were stable under 20-run statistical convergence per point, suggesting this is not purely stochastic noise at the measurement level.

However, at this stage we treat this as an observational regime structure, not a fully validated phase boundary model.

A key open question is whether this regime separation persists under different distributed settings (e.g. FSDP / multi-GPU GB10 configurations), where allocator pressure and synchronization effects may interact differently.

We would be very interested to compare whether similar non-monotonic collapse regions appear in your production-scale traces.

@yonezaemon1 Honest answer: we don’t have a comparable dataset. We were reactive, bounding training sessions to dodge the stall, not running a controlled collapse-probability sweep across BS × HD. So I can’t tell you whether your plateau / transition / drop-off regimes show up in our traces. We didn’t instrument for it.

What I can offer is a confound worth weighing before assuming the regime structure transfers. Our GB10 is unified memory: 128GB shared between CPU and GPU. Allocator pressure and fragmentation play out across one shared pool, which your discrete-VRAM GTX 1070 runs don’t have. If any part of your collapse-probability structure is mediated by allocator state rather than pure compute and sync dynamics, it may not map cleanly onto UMA. The transition band could shift or smear rather than reproduce. That’s a hypothesis, not something we’ve measured against your data.

If you do get to test on FSDP / multi-GPU GB10, the high-variance drop-off region past your HD threshold is where I’d watch first. That’s exactly where I’d expect allocator effects to interact with the synchronization picture you’re describing.

Thanks — this clarifies the decomposition point well.

At this stage I think the right next step is to move from model alignment to controlled instrumentation. We’ll define a minimal measurement set for:

  • allocator pressure

  • synchronization drift

  • step-time variance

and run a bounded sweep across both discrete VRAM and UMA setups.

I’ll come back once we have initial comparative structure from the data.

Following up with measurement results from GTX 1070 (discrete VRAM, 8GB VRAM).

Environment:

  • Device: NVIDIA GeForce GTX 1070
  • PyTorch: 2.7.1+cu118
  • Matrix size: 2048x2048, N=50 steps

Results:

  • Step Time: mean 12.95ms, stdev 48.57ms
  • Sync Drift: mean 5.394ms, stdev 0.333ms
  • VRAM Usage: mean 56.1MB, stdev 0.00MB
  • Permutation Instability (PI): 0.960

Key observation:
Sync Drift variance is low (stdev 0.333ms), meaning synchronization itself is stable. Yet Permutation Instability = 0.960, meaning the ordering structure of step times is nearly fully collapsed.

This is the critical finding: the instability is not in synchronization latency itself, but in the ordering structure of execution timing.

Your hypothesis about allocator state as a confounding factor is well-taken. In this discrete VRAM configuration, the allocator pool is separate from CPU memory — yet PI≈0.96 still appears. This suggests the ordering collapse may be architecture-independent, occurring even without UMA allocator interactions.

Proposed next comparison:
If you can run the same measurement on GB10 (UMA), comparing PI values between UMA and discrete VRAM would help isolate whether the collapse structure is driven by allocator dynamics or by something more fundamental in the synchronization process itself.

Happy to share the measurement script if useful.

Additional measurements from GTX 1070 (discrete VRAM) following up on our earlier exchange.


1. PI does not scale with workload size

Tested matrix sizes 512 / 1024 / 2048 / 4096, N=50 steps each. PI remained consistently in the 0.96–1.00 range across all sizes.

Size Mean(ms) Stdev(ms) PI
512 1.17 5.185 1.000
1024 1.92 0.054 0.980
2048 5.42 0.283 1.000
4096 31.99 0.382 0.980

The ordering instability appears load-independent in this configuration.


2. PI shows repeated collapse and partial recovery over time

Over 200 steps (block size 10), PI did not drift monotonically. Instead, it oscillated between 0.70 and 1.00, with alternating periods of higher and lower instability.

This behavior does not fit a simple drift model.


3. No correlation with VRAM usage

VRAM remained flat at ~56.1MB throughout all runs. No variation that could account for PI fluctuation. Allocator pressure does not appear to be a factor in this discrete VRAM setup.


4. Sync Drift does not precede PI changes

Sync Drift showed a weak positive correlation with PI (high-PI blocks: 4.310ms avg vs low-PI blocks: 4.148ms avg, delta +0.162ms), but block-ahead prediction accuracy was 52.6% — essentially chance.

Sync Drift and PI appear to be co-varying rather than causally linked in this direction.


Working hypothesis (preliminary)

Given that VRAM, allocator pressure, and Sync Drift all fail to explain PI variation, we are tentatively looking at execution scheduling structure as a candidate — something above the memory hierarchy layer.


Proposed comparison

If GB10 (UMA) shows VRAM variation while PI behaves similarly, that would help isolate whether this is memory-hierarchy-dependent or something in the execution ordering layer itself.

Would it be feasible to run a comparable sweep on GB10 when you have the opportunity?

Appreciate you posting the GTX 1070 baseline — PI staying flat across 512/1024/2048/4096 is the kind of clean cross-size signal we couldn’t isolate on our end. Curious whether you’re planning a same-test sweep on a newer-arch single-GPU before going to bounded multi-node, or jumping straight to the multi-node measurement next.

Thank you for your continued engagement.

Following up on the PI measurements, I attempted to introduce local trajectory volume to explore whether PI collapse could be reframed geometrically.

The approach:

  • Embedded step_time series into 3D space using delay embedding (TAU=1)

  • Constructed tetrahedra from 4 consecutive points

  • Computed local volume V(t) = det([P1-P0, P2-P0, P3-P0]) / 6

Conditions: GTX1070 / PyTorch 2.7.1+cu118 / 2048x2048 / N=200

Results:

  • Baseline region: mean(V) = 0.0013

  • Around step 125-129: local volume exhibited transient spikes reaching approximately 10x the baseline magnitude

  • This pattern reproduced across 2 trials

Critically, step_time values during that region remained within normal range (~3.7ms).

This suggests the anomaly may reside not in latency magnitude, but in execution trajectory geometry — consistent with what was observed in PI, where order structure collapsed while individual values appeared normal.

The cause remains unresolved. However, the observation that local geometric structure changes while monitored values stay normal may indicate something beyond UMA allocator interaction alone.

Whether this is GTX1070-specific or architecture-independent is still undetermined. A direct comparison with GB10 data would help clarify this significantly.

Follow-up — and an important correction.

After posting the local-trajectory-volume V(t) result, I ran controlled experiments to test whether the signal was real or an artifact. I want to correct the record before this goes any further, because the strong claim — that PI / V(t) is a direct detector of silent NCCL stalls — does not survive the controls.

What I found:

  1. The raw step_time around the “spike” region (steps 120–132) is completely flat — all ~3.74ms, no anomaly. The spike only appears after the volume transform.

  2. My V(t) computation was not the delay-embedding I described earlier. The code built the tetrahedron from three unrelated quantities (step_time, sync_drift, cpu_launch) as x/y/z axes. Holding cpu_launch constant drops the V(t) spike to exactly zero — so it was driven entirely by CPU-side allocation jitter from torch.randn, not GPU execution. Apologies for the mismatch between the method I described and the actual code.

  3. Across 5 seeds, the V(t) peak location jumps around (steps 168, 48, 196, 0, 1; std ≈ 83). It is not fixed at step 125 — the “reproduced twice” was coincidence from the noise tail.

I re-tested PI itself with the same rigor:

  • Normal PI = 0.915
  • PI after fully shuffling step_time = 0.925
  • PI after replacing step_time with pure Gaussian noise = 0.910

Statistically identical. Shuffling the order doesn’t change PI; replacing the data with pure noise doesn’t change PI. So PI is not capturing execution-order structure — it’s the rank noise you get from ranking ~10 nearly-identical values. The “cross-size constant PI” is, unfortunately, just that artifact being size-independent.

So: under these conditions (single GTX 1070, matmul, synchronize present), PI and 4-point volume were very likely artifacts, and I don’t claim them as direct stall detectors.

That said, I’m not abandoning the underlying question. Silent NCCL stalls are a real problem, and the thing I was actually reaching for — local structural change in the synchronization path while latency still looks normal — is still worth instrumenting. What today’s controls ruled out is this particular observable, not the question itself. My next step is to look for a more direct quantity: per-kernel timing via CUDA events rather than wall-clock step_time, and ideally measured on the collective path itself rather than a matmul proxy.

Thank you for engaging seriously with the data — that’s what pushed me to falsify it properly. If you’ve seen silent stalls through other means, I’d genuinely like to hear what the underlying signal looked like. That’s the real target.

Respect for posting the falsification — that’s the right move and it’s harder to do publicly than to just quietly drop the line of inquiry.

On the underlying question: we don’t have a pre-collapse detector either, just the reactive end. What we observed on multi-node GB10 was the symptom shape you described earlier — util pegged at ~100%, no hardware fault logged, training quietly stops, NCCL watchdog timeout fires well after the actual stall. What worked operationally was bounding session length (cap epochs, save checkpoint frequently, restart) to dodge the stall rather than detect it. Not satisfying, but it kept us moving.

We have NOT instrumented the collective path with CUDA events or anything that distinguishes a stall-precursor from normal noise — so I can’t contribute to detector design from existing data. But the reframing direction sounds right to me: per-kernel timing on the collective itself rather than a matmul proxy, and the controls discipline you just applied is what’ll separate signal from artifact when one shows up.

Thank you — and the symptom description is exactly what I needed: utilization pinned near 100%, clean logs, training quietly stops, watchdog fires long after the real stall. That’s a “looks normal, but isn’t progressing” signature, which lines up with what my controls kept pointing at: the mean is uninformative, the structure (if any) is elsewhere.

Since the correction I kept running single-GPU controls to map out what is NOT the signal, so I don’t chase it again:

  • CUDA-event per-kernel timing instead of wall-clock: same result — normal PI ≈ shuffled ≈ pure-noise. Precision wasn’t the issue.
  • Sequential-dependent chains (output feeds next input): still no order structure. lag-1 autocorrelation ≈ 0.
  • Compute-bound (matmul) vs bandwidth-bound (large copy, elementwise): both time-independent. A one-off autocorr of -0.30 did not survive 5 repeats — it was a statistical fluke, plus one run with a clear external-perturbation signature (mean 3×, autocorr ≈ 0.99).

So single-GPU execution time appears time-independent across every observable I tried. That’s a clean negative result, and it narrows things: a stall isn’t an individual-GPU property — it never reproduced once in isolation — so it lives in the relationship between execution units.

My plan is speed-for-speed: rather than wait for a multi-node setup, I want to push how far single-GPU can go first. Two CUDA streams contending on one GPU is the smallest “two units meeting at a sync point” I can build — a miniature of the collective wait. The question is whether a “busy but not progressing” state (your 100%-util symptom, in miniature) shows up as a phase offset between the two streams while mean latency still looks fine. Same control discipline throughout — shuffle / seed-sweep / repeat-for-stability before I believe anything.

If single-GPU contention reproduces even a faint version of the signature, that’s the cheap testbed before touching the collective path. I’ll report back either way — including if it’s another dead end.

Additional update — follow-up experiments after my last post.

After posting the V(t) correction, I ran two more controlled experiments today to clarify the scope of what I’m actually seeing.

**Experiment 1: Symmetric contention on GTX 1070 (N=5000, 237 windows)**

Real signal: median diff = +0.085, positive windows 71% (169/237), sign test z = +6.56

Shuffle control: median diff = -0.008, z = -1.14

The phase-locking effect is statistically real (z >> 2), but the effect size is small (+0.085). This is not a strong lock — more like a weak memory: when two symmetric streams compete, the result of one step slightly biases the next. Detectable, but subtle.

Your earlier question about architecture scope pushed me to re-examine the design. I now think the signal only appears under symmetric contention, not isolated execution timing — which is why the single-stream baseline shows nothing.

**Experiment 2: CPU memory bandwidth contention (same design)**

Symmetric: z = -2.24 (no locking)

Asymmetric 4:1: z = -3.13 (no locking)

The effect does not appear in CPU memory bandwidth competition. This rules out “generic resource contention” as the explanation. Whatever is producing the phase-locking in the GPU case is specific to GPU execution mechanics — scheduler behavior, arbitration, or something in that layer.

**What this means for your question**

You asked whether I planned to move to multi-node before testing newer architectures. Honestly: with a single GTX 1070 and matmul as a proxy, I’ve probably reached the ceiling of what I can resolve. The effect is real but small (+0.085), reproducible across rounds, and GPU-specific rather than a general contention phenomenon.

This actually supports your framing. If a signal this weak is already GPU-specific and invisible in CPU bandwidth, then the meaningful dynamics are almost certainly in the collective path — synchronization barriers, ring dependencies, NCCL internals — exactly where you pointed from the beginning.

I don’t have the hardware to go there. But I wanted to close the loop with clean data before the thread goes quiet. If a GB10 comparison ever becomes possible, I’d be curious whether the phase-locking shows up stronger in UMA architecture, or disappears entirely.

Thank you again for engaging seriously with this.

Clean closure. Sign-test against shuffle, CPU bandwidth ruling out generic contention, negative on single-stream isolation — that all tightens what the effect actually is. The +0.085 reproducing under shuffle and rolling over on isolated execution is real evidence even if it doesn’t get you to a detector.

On the GB10/UMA question: right thing to ask. The architectural difference isn’t trivial. UMA collapses the CPU/GPU memory hierarchy into one shared pool, so symmetric contention could either amplify (allocator and scheduler now arbitrate across both compute domains on shared backing storage) or dissolve (the contention path looks different from the discrete-VRAM case where GPU memory owns arbitration end-to-end). Genuinely don’t know which way it goes.

Honest constraint on our side: the 4-node GB10 is running production training. Pulling a GPU for a two-stream matmul experiment is real schedule cost, not a quick afternoon. So no near-term promise on a comparison. If we get a window, or a non-production GB10 frees up, your design reproduces from what you’ve already documented.

Either way, clean negative data is a good place to stop.

Hi Jesse,

Thank you for taking the time to run the additional checks and for the very careful follow-up analysis.

The way you isolated the effect using permutation control, CPU bandwidth comparison, and single-stream baselines was particularly valuable. The combination of your statistical test (z = +6.56) with the controlled null experiments makes the signal much more interpretable than I could have achieved from the original single-machine observations alone.

I also appreciate your conclusion regarding the single-GPU limitation. I agree with your assessment that the observable effect size on a GTX 1070 is small, and that the more meaningful dynamics likely emerge in collective execution paths involving synchronization barriers, multi-stream interaction, or multi-node orchestration layers rather than isolated kernels.

That framing aligns with my current understanding as well: the single-device setup seems to expose only a low-resolution projection of a broader scheduling and memory arbitration behavior.

Regarding GB10 / UMA-scale systems — understood that there are no commitments possible. If there is future availability and you revisit this with existing documentation, that would already be very valuable.

For completeness, I’ve consolidated the minimal reproducible setup we discussed here:

It may be useful as a stable reference point if future testing across architectures becomes feasible.

Thanks again for the rigor and the time you put into this. It significantly helped clarify the boundary conditions of the effect.

Best regards,
yonezaemon1

Thanks for the closing-loop note and for putting the MRE on GitHub. Concrete reproducer beats prose — anyone who comes back to this thread has a stable artifact to start from now.

If a non-production GB10 window opens up, the symmetric-stream test on UMA is exactly the comparison I’d want to run, and the published seed-sweep + repeat-for-stability discipline you used makes it a one-shot setup rather than a methodology argument. The “low-resolution projection of broader scheduling and memory arbitration behavior” framing you landed on reads right. Single-GPU isolates one boundary; the dynamics that matter probably live at the collective synchronization layer.

Good work on the falsification follow-up — the discipline of running the controls and reporting the result is the rare part.