Sorry for the late response - aggregation will have some limitations as you described. NCCL should be able to sort things out across multiple links, so will non-RDMA, regular traffic, in mode2 (XOR).
As of the port combination - here’s the common enumeration of CX7 NIC ports within the OS:
-port 0 first half
-port 1 first half
-wifi goes in here
-port 0 second half
-port 1 second half
Aggregation should be built between the halves, of course, so port 0 (first and second half) OR port 1 (first and second half). Using two cables shouldn’t provide any speed iimprovements.
C/P-states:
Now on C-states - GB10 does offer C-states (C0, WFI/C1 (ARM halt) and C3). I see @maiia 's figures on latency, this could be due to core affinity to the PCIe device, since Spark’s CPU has two major groups of cores with their own caches (nately a 10+10 arrangement). C2 isn’t available, and wake-up times from C3 are long. NCCL is topology aware and will figure it all out.
C-states can be effectively disabled by the governor set in performance mode, and you can verify that by probing the state timers for state1 and state3, but P-states will always run. We have about 100 MHz increments and many steps on both p and e cores. E-cores have a very agressive stepping, going all the way down to 330 MHz opportunistically to conserve power.