Just FYI, goofing off with some of the lab rats today. Connected two Sparks with a pair of QSFP56 cables, bond mode balance-rr, MTUBytes 9000, assigned static addresses with host file entries. Respectable rsync speeds, mpirun working fine between both units.
Quick hack (using systemctl for network) :
In /etc/systemd/network (the other unit is 10.254.254.2/30 )
cat bond0.netdev
[NetDev]
Name=bond0
Kind=bond
MTUBytes=9000
[Bond]
Mode=balance-rr
cat bond0.network
[Match]
Name=bond0
[Network]
Address=10.254.254.1/30
[Link]
MTUBytes=9000
cat enp1s0f0np0.network
[Match]
Name=enp1s0f0np0
[Network]
Bond=bond0
[Link]
MTUBytes=9000
cat enp1s0f1np1.network
[Match]
Name=enp1s0f1np1
[Network]
Bond=bond0
[Link]
MTUBytes=9000
Hi! Can you share outputs of the numbers? Also can you share the network tests output?
I am asking because I just stacked my two sparks and I used the `all_gather_perf` scripts from `nccl_test` and I expected to see bigger bus bandwidths.
“If you have no expectations, it’s hard to be disappointed!"“ I’ve been spoiled by much faster speeds which is why I said WTF when I was moving data between the sparks and decided to see what could be done.
rsync -ruv testfile.bin ss2:
sending incremental file list
testfile.bin
sent 17,184,063,578 bytes received 35 bytes 404,330,908.54 bytes/sec
total size is 17,179,869,184 speedup is 1.00
I’ll run nccl tests when I’m done with work later today
I wouldn’t count these for anything, my space heater lab rats are so far off the stock build image now that I’d have to wipe them and keep a notebook to record step by step changes. I’m a happy camper, great machines! :-) LOL, they are going to keep me very warm this winter!