Inter-GPU Latency on B200 Higher Than on Hopper

Hi everyone,

I optimized the all_reduce latency in nccl-tests on a server with 8× B200 GPUs. With an 8-byte payload, the best latency I can achieve is around 3 µs.

Image

However, using exactly the same tuning methods on servers with 8× H800 or 8× H200 GPUs, the minimum latency for the same 8-byte test reaches about 2.3 µs.

Image

Is Blackwell expected to show worse small-message latency than Hopper?
This feels a bit counter-intuitive.

Thanks!