Body
Hi NVIDIA HPC team,
I’m sharing an observation from distributed training logs that may be relevant to NCCL-based multi-GPU systems.
Problem
In large-scale distributed training, we occasionally observe silent stalls where:
-
GPU utilization remains at ~100%
-
No immediate hardware failure is reported
-
Training progress stops due to NCCL synchronization issues or straggler effects
These cases are typically only detected after timeout or manual inspection.
Observation
From analysis of historical failure logs (GTX 1070-based experiments), we observed a consistent pre-collapse pattern:
-
A structural deviation signal begins increasing approximately ~30 steps before full stall
-
Standard telemetry (GPU util, memory usage, temperature) remains stable during this period
-
The failure manifests later as a full synchronization breakdown
This suggests that the failure is not abrupt, but preceded by a measurable state-space deformation.
Approach (Conceptual)
We compute a deterministic deviation score by comparing runtime telemetry against a baseline distribution of normal execution behavior.
Key properties:
-
No machine learning model
-
No training phase
-
Purely deterministic transformation of telemetry streams
-
Designed as a passive diagnostic layer (no runtime intervention)
Question
Has anyone observed similar pre-collapse structural signals in NCCL-based distributed training systems?
Specifically:
-
Are silent stalls known to have measurable precursors at the runtime layer (not just hardware metrics)?
-
Is there any existing work in NCCL / CUDA runtime that models failure as gradual state drift rather than discrete events?
Any feedback on whether this kind of signal aligns with known failure modes in large-scale GPU clusters would be appreciated.
Thanks in advance for any insights.