GTX480 performance on different motherboards performance differs on AMD and INTEL motherboards

Hi all,

We have a puzzling observation related to performance differences for GTX480 on 2 systems:

  1. Intel system: MBD-X8DTi-F, Quad Core Intel Xeon 5500
  2. AMD system: Megatrends H8DAi, Quad-Core AMD Opteron 2376

Other than motherboard/CPU, the two systems are identical, as follows:
Running Fedora 10 OS, same kernel version.
Same video driver is installed (195.36.15)
The systems are tested with the same physical card GTX480 (also another instance of card was tested, with same results)
Both motherboards use PCI-E x16 2.0 interface.
All tested application are compiled with CUDA toolkit 3.0 on one computer, so testing is carried out with the same executable on both systems.

Tested application (C, CUDA):

  1. Complex applications with input/output data transfer, multiple kernels
  2. One long kernel is timed, with execution time of few seconds. No host-device data transfer.

Results:
On all applications, the AMD based system performs faster by 8-9 %.
Changing the driver to 256.25 improves performance on both systems, leaving difference between systems unchaged.

We expect CUDA performance to be approximately equal on difference systems, so the observed differences are very strange. Note that in one of our applications, there is no device-host data transfer, so we assume CPU memory should not have any effect.

If anyone has similar observations, or any idea about why this happens, please respond…

Thanks a lot

Could you run the SDK app “bandwidthTest --memory=pinned” from the Tylersburg motherboard and post the output here.

Here are the results:

------------------- AMD system ----------------------

Running on…

Device 0: GeForce GTX 480
Quick Mode

Host to Device Bandwidth, 1 Device(s), Pinned memory, Write-Combined Memory Enabled
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3342.7

Device to Host Bandwidth, 1 Device(s), Pinned memory, Write-Combined Memory Enabled
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3342.8

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 116457.8

[bandwidthTest] - Test results:
PASSED

------------------- Intel system ---------------------

/root/bandwidthTest Starting…

Running on…

Device 0: GeForce GTX 480
Quick Mode

Host to Device Bandwidth, 1 Device(s), Pinned memory, Write-Combined Memory Enabled
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5729.0

Device to Host Bandwidth, 1 Device(s), Pinned memory, Write-Combined Memory Enabled
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6305.3

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 114713.1

[bandwidthTest] - Test results:
PASSED

Are you sure you have those labeled correctly? On the results labeled “AMD”, you are either only getting 8 PCIe lanes or are limited to PCIex16 1.0 throughput. If this was the slower system, then I would say that the decreased PCIe bandwidth was the cause.

This is not a good example to express the benefits of GDDR5 over GDDR3

For instance in my GDDR3 GTX-275 I have these results:

Running on…

Device 0: GeForce GTX 275

Quick Mode

Host to Device Bandwidth, 1 Device(s), Pinned memory, Write-Combined Memory Enabled

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5806.5

Device to Host Bandwidth, 1 Device(s), Pinned memory, Write-Combined Memory Enabled

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5475.4

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 104328.0

[bandwidthTest] - Test results:

PASSED

Press to Quit…


What I would like to see is the Matrix Multiplication example in the SDK for the 480 is it possible for you to do this test?

I get: 81 GFlop/s with my GTX-275 how much do you get?

The matrix multiplication is a nice example to test Memory Transfer since the speed of the execution is heavily based on the Data Transfer to the GTX Processor. From a quick glance to the code I see that the transfer to the matrices in the shared memory is coalesced so it should be a nice example.

It is the MatrixMul.exe in the SDK, kindly run it and post the results, this should be useful.

Edit also check this:

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win64\Release>matrixmul.exe --sizemult=10

[ matrixMul ]

matrixmul.exe Starting…

Device 0: “GeForce GTX 275” with Compute 1.3 capability

Using Matrix Sizes: A(800 x 1600), B(800 x 800), C(800 x 1600)

Run Kernels…

matrixMul, Throughput = 211.0272 GFlop/s, Time = 0.00970 s, Size = 2048000000 Ops, NumDevsUsed = 1, Workgroup = 256

This is the near peak performance that the GTX-275 can attain in matrix multiplication of course with a 16x16 block (standard). With some few modifications you can make it better but this is the standard. I want to see your results since I plan to buy a GTX-480 card and I am not certain unless I see this example performance.

Alex.

Newbie equipment question-- Could an 8x lane on a 16x PCIe 1.0 slot (tyan mobo–bah humbug, AMD opteron 2356, F12, GTX 470) lower the H->D and and D->H by more than 2/3? D<–>D also comes in 25%ish lower. I was under expectation the drop off would be more linear.

The claim is that performance differs based on the system, not on the GPU.

-----Device to Device Bandwidth, 1 Device(s)--------

This actually tells nothing to me as you can see. Of course there is a performance on Motherboards but who really cares about Device to host since when the Kernel is invoked there is not such thing. If you plan to see Host to Device as a bottleneck then you have failed as a programmer in CUDA. It is as strict as this.

Edit : In general I avoid using the term fail since I know some cases that people have been made a complete redicule claiming this phrase, so I avoid it. But in this case things are very strict when you deal with GPU and its large amount of memory the code should not communicate in tons of MBs with the DDR memory of the motherboard. It is very bad programming. :-)

Best,

Alexander.

Whatever. I didn’t write their program. To the extent that it demonstrates a problem on one of their systems, it is a “good” program.

You are reducing the number of lanes by 50% (from 16 to 8) and the throughput per lane by 50% (from 500 MB/s to 250 MB/s), so your H->D and and D->H numbers look about right.

No I am not referring to their program CUDA SDK is fantastic in programming style and a reference. Actually looking at it is just th cudaMemcpyAsync call.

I am talking in general practice. This should be regarded as a test that you get close to 4GB/s device to host and host to device

The asynchronous case is indeed interesting that it differs from the two companies. Anyway I use Intel so am I on the safe side? :-)

Best,

Alex.

It is worth pointing out that isn’t correct. The AMD board is a MCP55 Pro chipset board, which is only PCI-e 1.0 compliant, and is why the bandwidth test numbers are lower.

As to why the AMD board is faster, the only thing I can think of is NUMA processor affinity and command latency. It could be that the Opteron arrangement can push commands and data to the device with lower latency than the Intel setup (even though the total bandwidth is lower).

Yes, my results are labelled correctly. Although this seems to be contradictory with timing results, it is not.

The reason is that the tests that we run should not be affected by host-device transfer speed. One of the kernels that we time does not perform host-device transfer. Another application performs host-device transfer in a parallel stream. In this later case transfers take shorter time comparing with the computation kernel.

The computation time differences have some other cause rather than host-device transfer rates.

Thanks for pointing out that we got PCI-e 1.0 on AMD system.

Could you please point us to any source of more information on the later suggestion ? Are there published specs ?

I was just checking since you seemed to be under the impression that the bandwidth was the same (PCI-E x16 2.0) in both systems. There had been a previous discussion here of anomalous bandwidth results on Tylersburg motherboards (tmurray hypothesized an immature BIOS), although this appears not have been your problem. Avidday’s suggestion also makes sense, and is corroborated by tmurray’s comments in the above thread.

I don’t know how the GPU derives the actual shader frequency, but maybe the two systems just run at slightly different PCIe frequencies?