Also, better be sure you’re not memory bandwidth bound, depending on the board you’re using to put in. Lots of these so-called 3 and 4 way 16x boards can’t pump all 16 lanes to each slot at once. (there are only 36 lanes connecting an Intel IOH header). So they end up using some sort of PCI-E switch instead, and you have to be careful which slots you depend on bandwidth from at the same time, balancing use across PCI-E switches.
Additionally, even if you’re just using 32 lanes… that still saturates the uni-directional QPI bw of 12.8 GB/sec (assuming 2 QPI links). So you have up to an 18GB/sec need assuming uni-directional transfer, and a 12.8 GB/sec pipe to host memory. In real world tests, I was able to push only 10GB/sec to host memory in one direction, confirming this. This was with single socket X58 though- future multi-socket, multi-IOH chipsets may better balance pci-e slots over them.
Regarding the multiple S1070 hookup- the bus isn’t hard divided to 8 lanes per GPU, rather it’s switched. So you can get near 16x performance if only accessing a single gpu/slot.
This is the board I hooked 8 S1070’s to btw:
Caveat: this board, though dubbed “supercomputer”, is still a desktop board and has the stupid bios requirement for a video card to be present. Since it’s PCI-E only, this means one of the 4 x16 slots must be driven down to 8x to share w/ the video card.