I too am getting very unexpected test results with a dual IOH machine.
I’m using 8 GPUs (4 Tesla S1070 units) hooked up to the only dual IOH board I could find with pci-e lanes carved up into 4 x16 slots (Tyan S7025), and 2 Xeon 5590 cpus. If anyone else tries this, note that this board won’t post with all 4 slots occupied unless it has the latest bios update.
I would expect 6GB/sec per slot delivered bw (of the 8GB/sec per slot peak), even concurrently, since I’ve got a peak 46GB/sec of memory bw and 25.2 GB/sec of uni-directional QPI bw (across 2 QPI links). But the highest concurrent I’ve seen is 13.5GB/sec HtoD. DtoH is less than half of that! Keep in mind, that’s across all 4 slots concurrently, using the bandwidth test found in the sdk. The reason I expect 6GB/sec per slot is because I’ve seen that before on a different board (Asus P6T7) that didn’t even have dedicated lanes like this board apparently does.
Further complicating things, this machine is not behaving like a NUMA system for HtoD transfers, or at least it’s extremely subtle. For any given cpu, bandwidth to all pci-e slots is about the same. However, one cpu (or IOH, I’m not sure) is distinctly slower than the other. (33% difference!!) For DtoH, the NUMA effect shines through more clearly- but there is still a socket difference that is more observable. Oddly, NUMA is more pronounced on one socket than the other.
For HtoD transfers, one socket gets about 4-4.2GB/sec independently to each gpu. The other gets 2.8-3.3 GB/sec to each gpu. This is consistent, repeatable performance. Both numbers are abysmal, but the both the fact that it varies by socket and the fact that it does not vary by GPU (expected NUMA effect) is a mystery to me at the moment. If it is varying, it’s not much.
For DtoH transfers, one socket gets 2.1-2.5 GB/sec to each GPU, with an observable NUMA effect dividing that range. Similarly to the HtoD, the other socket behaves differently. But it’s range is wider, NUMA more pronounced, with a range of 1.9-3.1 GB/sec. I don’t have any explanation for this.
For completeness, the bw matrices are attached. Units are MB/sec, and I have HT disabled on the host. If anyone can offer any explanation or theory to:
Why I’m not seeing 6GB/slot independently
Why NUMA effect more pronounced for DtoH
Why DtoH NUMA effect more pronounced on one socket
Why HtoD 33% faster than DtoH
dtoh.txt (1.01 KB)
htod.txt (1.01 KB)