How is NCCL supposed to behave when a NaN value provided to allreduce in one or more of the ranks ?
I would imagine some evidence of NaN in the result.
That is reasonable to expect : NaN op number = NaN. Would not crash, hang , or do other nasty stuff ?
I’m not aware of any such expectation.