Has there been any progress on this?
I can get 2x 4090s working via NCCL_P2P_DISABLE=1. But moving to 3x 4090s barely improves throughput over 2x 4090s.
Has there been any progress on this?
I can get 2x 4090s working via NCCL_P2P_DISABLE=1. But moving to 3x 4090s barely improves throughput over 2x 4090s.