Hi everyone,
I optimized the all_reduce latency in nccl-tests on a server with 8× B200 GPUs. With an 8-byte payload, the best latency I can achieve is around 3 µs.
However, using exactly the same tuning methods on servers with 8× H800 or 8× H200 GPUs, the minimum latency for the same 8-byte test reaches about 2.3 µs.
Is Blackwell expected to show worse small-message latency than Hopper?
This feels a bit counter-intuitive.
Thanks!

