Standard nVidia CUDA tests fail with dual RTX 4090 Linux box

Has there been any progress on this?

I can get 2x 4090s working via NCCL_P2P_DISABLE=1. But moving to 3x 4090s barely improves throughput over 2x 4090s.