Strange bandwidthTest results with new hardware Lower, and asymetric H->D, D->H

Jaiyk · August 10, 2009, 3:31pm

Using cuda 2.3, 190.18 under RHEL 5.3, 64 bit.

Moving from an older cuda version on an older machine (HP xw8600, older Xeon CPU), to latest cuda / newer machine (HP DL370 G6 server, core i7 CPU), we seem to have encountered a strange drop in bandwidthTest --memory=pinned performance.

In the old setup, I was able to see ~5400 MB/s in each direction with a GTX280.
In the new server with the same card, I get 3000 H->D, 3200 D->H.
In the new server with a GTX295, I get 5000 H->D, 3200 D->H.
In the new server connected to a C1060 tesla, I get 5600 H->D, 3400 D->H.

Any idea:
Why the 280 is lower than before?
Why the bandwidths are so asymmetrical?
Even stranger, regardless of the card, sometimes the D->H number drops down to 2000 (if I run the test several times, ~40% of the time it will read 2000).

Has anyone seen anything similar? Is this a byproduct of the beta driver?

tmurray · August 10, 2009, 4:30pm

I would try playing with numactl, forcing affinity of bandwidthTest to specific sockets, and seeing if bandwidth changes. I’m not sure if the DL370 is a dual-X58 chipset box, but if it is, that would certainly explain things.

Jaiyk · August 10, 2009, 7:18pm

So I tried:

numactl --physcpubind N ./bandwidthTest --memory=pinned

I definitely got better D->H performance when N was even (3200 MB/s) vs when N was odd (2000 MB/s) (however this didn’t affect the H->D bandwidth). So that explains why the results fluctuated: just based on which cpu it was assigned to. Why should the one core have better bandwidth than the other? Why does this only seem to affect D->H bandwidth, not H->D?

The fact that I’m able to get 5600 MB/s H->D with the tesla host interface card seems to prove that I’m in an x16 slot. So why is the GTX280 H->D now stuck down in the 3K’s? And why is the D->H bandwidth for all these cards now seemingly capped in the 3K’s?

The DL370 box has dual quad-core X5570s, and uses the Intel 5520 chipset. Judging by http://www.intel.com/products/server/chipsets/, 5520 may just be the dual-core server version of the X58 chipset. What do you mean by “that would certainly explain things”?

Thanks again for your time.

tmurray · August 10, 2009, 8:52pm

D->H is then almost certainly just a symptom of an immature BIOS. Not much we can do about it–you should bug HP to look into it.

As far as the fluctuation goes, lots of Nehalem/Tylersburg machines have two chipsets, where each CPU connects to a specific chipset. Each chipset in turn has its own set of PCIe lanes. Since memory is per-socket in Nehalem, you now have an extra hop when you transfer data from a GPU on CPU 0’s PCIe lanes to the memory on CPU 1’s socket. As a result, paying attention to NUMA matters.

Mat38 · August 11, 2009, 6:39am

I recently wrote a post on the forum about the same observation. We have a 2 Intel Nehalem CPUs server with 2 Tylersburgs each of which is connected to a C1060 card. The measured bandwidth are about 6 GB/s for H->D and only 3.5 GB/s for D->H. This is when there is a “perfect” affinity between memory / CPU and GPU. When we assign the opposite GPU to a CPU (playing with the numactl command), the bandwidth are lower, ~5 GB/s for H->D and 2 GB/s for D->H.
We still can not explain why the bandwidth for D->H is lower than the one for H->D, but we are quite sure this comes from the 2 Tylersburg configuration. Indeed, maybe future releases of BIOS would correct that.

Jaiyk · August 11, 2009, 10:58pm

Thanks for letting me know about your post, Mat, (which I assume is this: http://forums.nvidia.com/index.php?showtopic=102207&hl=)

So it appears the problem only affects Fig 3 on page 6 of [url=“Resource & Design Center for Development with Intel”]Resource & Design Center for Development with Intel , and not Figs 1 or 2.

If you don’t mind me asking, what were the specs on both your Single IOH machine (that gave good results) and your Dual IOH machine? Or anybody who gets good bandwidth numbers with a Nehalem chip, I’d love to hear your configuration.

Looking at the chipset errata does not leave me with an easy feeling in my stomach:
[url=“Intel Support”]http://www.intel.com/assets/PDF/specupdate/321329.pdf[/url]

tmurray · August 11, 2009, 11:00pm

I do have a single X58 dual-socket Nehalem machine, and its bandwidth is perfectly normal. So yes, it’s a two-chipset problem.

Mat38 · August 12, 2009, 8:26am

Indeed, I also ran the tests on a single IOH machine with 2 Nehalem X5570 CPUs. The bandwidth in both directions are good (~6 GB/s). So as tmurray sais, the problem comes from the 2 IOH configuration.

Mat38 · August 17, 2009, 9:05am

When reading your post again, I was wondering : is your machin has 2 Tylersburgs or just one ? Because, I also made tests on a single Tylersburg machine but with two sockets and the measered bandwidths were good. So now I am confused. For you, does the problem come from the 2 Tylerburgs or the 2 sockets ?

Jeremy_Enos · November 10, 2009, 10:41pm

I too am getting very unexpected test results with a dual IOH machine.

I’m using 8 GPUs (4 Tesla S1070 units) hooked up to the only dual IOH board I could find with pci-e lanes carved up into 4 x16 slots (Tyan S7025), and 2 Xeon 5590 cpus. If anyone else tries this, note that this board won’t post with all 4 slots occupied unless it has the latest bios update.

I would expect 6GB/sec per slot delivered bw (of the 8GB/sec per slot peak), even concurrently, since I’ve got a peak 46GB/sec of memory bw and 25.2 GB/sec of uni-directional QPI bw (across 2 QPI links). But the highest concurrent I’ve seen is 13.5GB/sec HtoD. DtoH is less than half of that! Keep in mind, that’s across all 4 slots concurrently, using the bandwidth test found in the sdk. The reason I expect 6GB/sec per slot is because I’ve seen that before on a different board (Asus P6T7) that didn’t even have dedicated lanes like this board apparently does.

Further complicating things, this machine is not behaving like a NUMA system for HtoD transfers, or at least it’s extremely subtle. For any given cpu, bandwidth to all pci-e slots is about the same. However, one cpu (or IOH, I’m not sure) is distinctly slower than the other. (33% difference!!) For DtoH, the NUMA effect shines through more clearly- but there is still a socket difference that is more observable. Oddly, NUMA is more pronounced on one socket than the other.

For HtoD transfers, one socket gets about 4-4.2GB/sec independently to each gpu. The other gets 2.8-3.3 GB/sec to each gpu. This is consistent, repeatable performance. Both numbers are abysmal, but the both the fact that it varies by socket and the fact that it does not vary by GPU (expected NUMA effect) is a mystery to me at the moment. If it is varying, it’s not much.

For DtoH transfers, one socket gets 2.1-2.5 GB/sec to each GPU, with an observable NUMA effect dividing that range. Similarly to the HtoD, the other socket behaves differently. But it’s range is wider, NUMA more pronounced, with a range of 1.9-3.1 GB/sec. I don’t have any explanation for this.

For completeness, the bw matrices are attached. Units are MB/sec, and I have HT disabled on the host. If anyone can offer any explanation or theory to:

[*]Why I’m not seeing 6GB/slot independently

[*]Why NUMA effect more pronounced for DtoH

[*]Why DtoH NUMA effect more pronounced on one socket

[*]Why HtoD 33% faster than DtoH

I’m interested.

thx-

Jeremy Enos
dtoh.txt (1.01 KB)
htod.txt (1.01 KB)

nvl · November 11, 2009, 8:25am

I am also getting asymmetric bandwidth results for pinned memory. ~5000 MB/s for host->device and ~3000 MB/s for device->host. However for pageable memory, the bandwidth is quite symmetric. Any ideas why this is so?

Jeremy_Enos · November 13, 2009, 5:33pm

The first htod and dtoh numbers were from the sdk bw test. I’ve used a different test (acquired from this forum) to retest. Similar results- closer to 6GB/sec that I’ve observed on other boards, but the asymmetric differences are still there between htod and dtoh, and also differing performance between each cpu socket to it’s near gpus.

New numbers attached (same effect as before, but a little clear to see).
dtoh.txt (904 Bytes)
htod.txt (904 Bytes)

pszilard · January 19, 2010, 3:56pm

Just to refresh the discussion, does anyone got to know a reasonable explanation for the really low DtoH bandwidths? The penalty of communicating through two chipsets is understandable and acceptable if it’s about 10% loss or less.

Does anyone have experience with and/or benchmark results for Tyan S7025 - which is in fact the suggested motherboard for 4 card systems by NVIDIA. Does this board also exhibit the abysmal DtoH bandwidth?

I hope there’s some knowledge out there somebody can share…

Cheers,
Szilard

Jeremy_Enos · January 19, 2010, 7:36pm

I assume you’re asking for additional benchmarks on the S7025? Otherwise, see above- that’s the board used.

pszilard · January 20, 2010, 7:20pm

Although I am particularly interested in the performance of the Tyan board, I was asking more for additional information on the issue in general, pretty much what you summarized earlier with your 4 questions - especially the most annoying one, the painfully low DtoH trasfer.

As dmurray suggested, this might be an issue that will be/already is corrected by bios updates and it is not a “feature” of the dual Tylesburg architectures. Therefor, it might be as well the case, that this is an issue only with certain boards by certain manufacturers. I saw that you were using the S7025 for your benchmarks; were there any bios updates since that solved the issue?

I’d be glad to have some feedback on other boards as well (I guess Supermicro also has 2-3 mainboards with 4 full speed PCI-E).

nnunn · January 20, 2010, 7:43pm

EVGA have two NF200 insted of dual Tylersburg on their dual-xeon 270-GT-W555.

tmurray, can you get the accountants to set you up with one of these for testing?

Comparing D->H on EVGA 270-GT-W555 vs. Tyan S7025 would help those of us

tooling up for eight x fermi.

thanks!

tmurray · January 20, 2010, 7:47pm

NF200 is a switch–you have the same number of PCIe lanes as a single X58 (36), which means you can only run two GPUs in x16.

nnunn · February 8, 2010, 8:22pm

Thanks tmurray. 4 x GTX-295 on a single X58 now seems almost miraculous – all those switches switching switches.

Given the limitations of X58 and 5520 chipsets, has anyone been able to compare performance of 4, 6 or 8 GPUs

on different motherboards, for bandwidth-limited multi-GPU kernels? Would make interesting table. :unsure:

avidday · February 8, 2010, 8:36pm

That many GPUs will completely saturate the QPI link between the IO hub and CPU (which I am assured tops out at 10Gb/s). I recently found that a pair of GTX275s in x16 slots is enough to be bottlenecked by the HT link on an AMD 790FX board I am using. All the gory details are here, if you are interested. A single X58 or 5000 series IO hub will be a bit better, but not that much.

Topic		Replies	Views
Memory bandwidth CUDA Programming and Performance	31	38781	October 5, 2007
Abnormally Low Device To Host Memory Bandwidth CUDA Programming and Performance	4	8438	August 4, 2009
TESLA bandwidthTest results CUDA Programming and Performance	5	2960	January 19, 2010
Is this PCIe 2.0 bandwidth low? 3.1 GB/s pinned CUDA Programming and Performance	45	20493	December 28, 2008
low transfer bandwidth between CPU and GPU my GTX 580 has a slow transfer speed CUDA Programming and Performance	9	3757	August 10, 2011
GTX480 performance on different motherboards performance differs on AMD and INTEL motherboards CUDA Programming and Performance	15	18532	June 7, 2010
[nvbandwidth] Debug an Anomalous Host to Device Memory Bandwidth CUDA Programming and Performance	7	1290	November 30, 2023
lopsided bandwidthTest: D->H is 3X slower than H->D CUDA Programming and Performance	0	2167	June 3, 2009
Best solution for maximizing bandwidth? More then 5.7G H->D bandwidth except Tesla CUDA Programming and Performance	24	11389	December 26, 2008
PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be CUDA Programming and Performance	38	68485	April 18, 2008

Strange bandwidthTest results with new hardware Lower, and asymetric H->D, D->H

Related topics