Strange bandwidthTest results with new hardware Lower, and asymetric H->D, D->H

Using cuda 2.3, 190.18 under RHEL 5.3, 64 bit.

Moving from an older cuda version on an older machine (HP xw8600, older Xeon CPU), to latest cuda / newer machine (HP DL370 G6 server, core i7 CPU), we seem to have encountered a strange drop in bandwidthTest --memory=pinned performance.

In the old setup, I was able to see ~5400 MB/s in each direction with a GTX280.
In the new server with the same card, I get 3000 H->D, 3200 D->H.
In the new server with a GTX295, I get 5000 H->D, 3200 D->H.
In the new server connected to a C1060 tesla, I get 5600 H->D, 3400 D->H.

Any idea:
Why the 280 is lower than before?
Why the bandwidths are so asymmetrical?
Even stranger, regardless of the card, sometimes the D->H number drops down to 2000 (if I run the test several times, ~40% of the time it will read 2000).

Has anyone seen anything similar? Is this a byproduct of the beta driver?

I would try playing with numactl, forcing affinity of bandwidthTest to specific sockets, and seeing if bandwidth changes. I’m not sure if the DL370 is a dual-X58 chipset box, but if it is, that would certainly explain things.

So I tried:

numactl --physcpubind N ./bandwidthTest --memory=pinned

I definitely got better D->H performance when N was even (3200 MB/s) vs when N was odd (2000 MB/s) (however this didn’t affect the H->D bandwidth). So that explains why the results fluctuated: just based on which cpu it was assigned to. Why should the one core have better bandwidth than the other? Why does this only seem to affect D->H bandwidth, not H->D?

The fact that I’m able to get 5600 MB/s H->D with the tesla host interface card seems to prove that I’m in an x16 slot. So why is the GTX280 H->D now stuck down in the 3K’s? And why is the D->H bandwidth for all these cards now seemingly capped in the 3K’s?

The DL370 box has dual quad-core X5570s, and uses the Intel 5520 chipset. Judging by http://www.intel.com/products/server/chipsets/, 5520 may just be the dual-core server version of the X58 chipset. What do you mean by “that would certainly explain things”?

Thanks again for your time.

D->H is then almost certainly just a symptom of an immature BIOS. Not much we can do about it–you should bug HP to look into it.

As far as the fluctuation goes, lots of Nehalem/Tylersburg machines have two chipsets, where each CPU connects to a specific chipset. Each chipset in turn has its own set of PCIe lanes. Since memory is per-socket in Nehalem, you now have an extra hop when you transfer data from a GPU on CPU 0’s PCIe lanes to the memory on CPU 1’s socket. As a result, paying attention to NUMA matters.

I recently wrote a post on the forum about the same observation. We have a 2 Intel Nehalem CPUs server with 2 Tylersburgs each of which is connected to a C1060 card. The measured bandwidth are about 6 GB/s for H->D and only 3.5 GB/s for D->H. This is when there is a “perfect” affinity between memory / CPU and GPU. When we assign the opposite GPU to a CPU (playing with the numactl command), the bandwidth are lower, ~5 GB/s for H->D and 2 GB/s for D->H.
We still can not explain why the bandwidth for D->H is lower than the one for H->D, but we are quite sure this comes from the 2 Tylersburg configuration. Indeed, maybe future releases of BIOS would correct that.

Thanks for letting me know about your post, Mat, (which I assume is this: http://forums.nvidia.com/index.php?showtopic=102207&hl=)

So it appears the problem only affects Fig 3 on page 6 of http://edc.intel.com/Download.aspx?id=2401…l=/default.aspx , and not Figs 1 or 2.

If you don’t mind me asking, what were the specs on both your Single IOH machine (that gave good results) and your Dual IOH machine? Or anybody who gets good bandwidth numbers with a Nehalem chip, I’d love to hear your configuration.

Looking at the chipset errata does not leave me with an easy feeling in my stomach:
http://www.intel.com/assets/PDF/specupdate/321329.pdf

I do have a single X58 dual-socket Nehalem machine, and its bandwidth is perfectly normal. So yes, it’s a two-chipset problem.

Indeed, I also ran the tests on a single IOH machine with 2 Nehalem X5570 CPUs. The bandwidth in both directions are good (~6 GB/s). So as tmurray sais, the problem comes from the 2 IOH configuration.

When reading your post again, I was wondering : is your machin has 2 Tylersburgs or just one ? Because, I also made tests on a single Tylersburg machine but with two sockets and the measered bandwidths were good. So now I am confused. For you, does the problem come from the 2 Tylerburgs or the 2 sockets ?

I too am getting very unexpected test results with a dual IOH machine.

I’m using 8 GPUs (4 Tesla S1070 units) hooked up to the only dual IOH board I could find with pci-e lanes carved up into 4 x16 slots (Tyan S7025), and 2 Xeon 5590 cpus. If anyone else tries this, note that this board won’t post with all 4 slots occupied unless it has the latest bios update.

I would expect 6GB/sec per slot delivered bw (of the 8GB/sec per slot peak), even concurrently, since I’ve got a peak 46GB/sec of memory bw and 25.2 GB/sec of uni-directional QPI bw (across 2 QPI links). But the highest concurrent I’ve seen is 13.5GB/sec HtoD. DtoH is less than half of that! Keep in mind, that’s across all 4 slots concurrently, using the bandwidth test found in the sdk. The reason I expect 6GB/sec per slot is because I’ve seen that before on a different board (Asus P6T7) that didn’t even have dedicated lanes like this board apparently does.

Further complicating things, this machine is not behaving like a NUMA system for HtoD transfers, or at least it’s extremely subtle. For any given cpu, bandwidth to all pci-e slots is about the same. However, one cpu (or IOH, I’m not sure) is distinctly slower than the other. (33% difference!!) For DtoH, the NUMA effect shines through more clearly- but there is still a socket difference that is more observable. Oddly, NUMA is more pronounced on one socket than the other.

For HtoD transfers, one socket gets about 4-4.2GB/sec independently to each gpu. The other gets 2.8-3.3 GB/sec to each gpu. This is consistent, repeatable performance. Both numbers are abysmal, but the both the fact that it varies by socket and the fact that it does not vary by GPU (expected NUMA effect) is a mystery to me at the moment. If it is varying, it’s not much.

For DtoH transfers, one socket gets 2.1-2.5 GB/sec to each GPU, with an observable NUMA effect dividing that range. Similarly to the HtoD, the other socket behaves differently. But it’s range is wider, NUMA more pronounced, with a range of 1.9-3.1 GB/sec. I don’t have any explanation for this.

For completeness, the bw matrices are attached. Units are MB/sec, and I have HT disabled on the host. If anyone can offer any explanation or theory to:

    Why I’m not seeing 6GB/slot independently

    Why NUMA effect more pronounced for DtoH

    Why DtoH NUMA effect more pronounced on one socket

    Why HtoD 33% faster than DtoH

I’m interested.

thx-

Jeremy Enos
dtoh.txt (1.01 KB)
htod.txt (1.01 KB)

I am also getting asymmetric bandwidth results for pinned memory. ~5000 MB/s for host->device and ~3000 MB/s for device->host. However for pageable memory, the bandwidth is quite symmetric. Any ideas why this is so?

The first htod and dtoh numbers were from the sdk bw test. I’ve used a different test (acquired from this forum) to retest. Similar results- closer to 6GB/sec that I’ve observed on other boards, but the asymmetric differences are still there between htod and dtoh, and also differing performance between each cpu socket to it’s near gpus.

New numbers attached (same effect as before, but a little clear to see).
dtoh.txt (904 Bytes)
htod.txt (904 Bytes)

Just to refresh the discussion, does anyone got to know a reasonable explanation for the really low DtoH bandwidths? The penalty of communicating through two chipsets is understandable and acceptable if it’s about 10% loss or less.

Does anyone have experience with and/or benchmark results for Tyan S7025 - which is in fact the suggested motherboard for 4 card systems by NVIDIA. Does this board also exhibit the abysmal DtoH bandwidth?

I hope there’s some knowledge out there somebody can share…

Cheers,
Szilard

I assume you’re asking for additional benchmarks on the S7025? Otherwise, see above- that’s the board used.

Although I am particularly interested in the performance of the Tyan board, I was asking more for additional information on the issue in general, pretty much what you summarized earlier with your 4 questions - especially the most annoying one, the painfully low DtoH trasfer.

As dmurray suggested, this might be an issue that will be/already is corrected by bios updates and it is not a “feature” of the dual Tylesburg architectures. Therefor, it might be as well the case, that this is an issue only with certain boards by certain manufacturers. I saw that you were using the S7025 for your benchmarks; were there any bios updates since that solved the issue?

I’d be glad to have some feedback on other boards as well (I guess Supermicro also has 2-3 mainboards with 4 full speed PCI-E).

EVGA have two NF200 insted of dual Tylersburg on their dual-xeon 270-GT-W555.

tmurray, can you get the accountants to set you up with one of these for testing?

Comparing D->H on EVGA 270-GT-W555 vs. Tyan S7025 would help those of us

tooling up for eight x fermi.

thanks!

NF200 is a switch–you have the same number of PCIe lanes as a single X58 (36), which means you can only run two GPUs in x16.

Thanks tmurray. 4 x GTX-295 on a single X58 now seems almost miraculous – all those switches switching switches.

Given the limitations of X58 and 5520 chipsets, has anyone been able to compare performance of 4, 6 or 8 GPUs

on different motherboards, for bandwidth-limited multi-GPU kernels? Would make interesting table. :unsure:

That many GPUs will completely saturate the QPI link between the IO hub and CPU (which I am assured tops out at 10Gb/s). I recently found that a pair of GTX275s in x16 slots is enough to be bottlenecked by the HT link on an AMD 790FX board I am using. All the gory details are here, if you are interested. A single X58 or 5000 series IO hub will be a bit better, but not that much.