AI Fabric Resiliency and Why Network Convergence Matters

jwitsoe · April 11, 2025, 6:26pm

Originally published at: AI Fabric Resiliency and Why Network Convergence Matters | NVIDIA Technical Blog

High-performance computing and deep learning workloads are extremely sensitive to latency. Packet loss forces retransmission or stalls in the communication pipeline, which directly increases latency and disrupts the synchronization between GPUs. This can degrade the performance of collective operations such as all-reduce or broadcast, where every GPU’s participation is required before progressing. The focus of…

Topic		Replies	Views
Accelerating AI Storage by up to 48% with NVIDIA Spectrum-X Networking Platform and Partners Technical Blog	1	9	February 4, 2025
New Optimizations To Accelerate Deep Learning Training on NVIDIA GPUs Technical Blog	0	434	August 25, 2020
Modernizing the Data Center with Accelerated Networking Technical Blog	1	266	January 30, 2024
Advancing Performance with NVIDIA SHARP In-Network Computing Technical Blog	1	42	October 25, 2024
Diagnosing Network Issues Faster with NVIDIA WJH Technical Blog	0	347	May 4, 2023
Powering Next-Generation AI Networking with NVIDIA SuperNICs Technical Blog	1	28	October 15, 2024
Advanced API Performance: Synchronization Technical Blog	0	363	July 31, 2023
Build Mainstream Servers for AI Training and 5G with the NVIDIA H100 CNX Technical Blog	0	496	March 30, 2022
How much does memory and compute overlap in a GEMM? cuDNN	1	1605	February 3, 2020
Explainer: What Is Deep Learning? Technical Blog	1	140	May 24, 2024

AI Fabric Resiliency and Why Network Convergence Matters

Related topics