Inter-GPU Latency on B200 Higher Than on Hopper

sliu94 · November 24, 2025, 10:06am

Hi everyone,

I optimized the all_reduce latency in nccl-tests on a server with 8× B200 GPUs. With an 8-byte payload, the best latency I can achieve is around 3 µs.

However, using exactly the same tuning methods on servers with 8× H800 or 8× H200 GPUs, the minimum latency for the same 8-byte test reaches about 2.3 µs.

Is Blackwell expected to show worse small-message latency than Hopper?
This feels a bit counter-intuitive.

Thanks!

Topic		Replies	Views
About NCCL benchmark result GPU-Accelerated Libraries nccl	0	1617	November 17, 2022
Compare the response time differences between 4xA100 and 8xH100 DGX Systems (Data Center)	0	473	December 28, 2023
NVSHMEM on 2 node GPUs, small size msg latency is very high GPU-Accelerated Libraries	0	85	February 26, 2025
HPL benchmark on A100(40GB PCIe) GPU-Accelerated Libraries cuda	1	1467	May 8, 2022
Nccl-test poor performance GPU-Accelerated Libraries	3	738	October 29, 2024
NVIDIA B200: NCCL WARN Cuda failure 700 'an illegal memory access was encountered' Confidential Computing cuda , deepseek	5	69	February 19, 2026
latency Host to GPU CUDA Programming and Performance	2	1376	April 20, 2010
NVIDIA Blackwell Delivers Massive Performance Leaps in MLPerf Inference v5.0 Technical Blog	1	69	April 2, 2025
Questions about p2pBandwidthLatencyTest CUDA Programming and Performance	2	936	July 16, 2019
Any way to measure the latency of a kernel launch? CUDA Programming and Performance	13	7983	July 22, 2022

Inter-GPU Latency on B200 Higher Than on Hopper

Related topics