Other than motherboard/CPU, the two systems are identical, as follows:
Running Fedora 10 OS, same kernel version.
Same video driver is installed (195.36.15)
The systems are tested with the same physical card GTX480 (also another instance of card was tested, with same results)
Both motherboards use PCI-E x16 2.0 interface.
All tested application are compiled with CUDA toolkit 3.0 on one computer, so testing is carried out with the same executable on both systems.
Tested application (C, CUDA):
Complex applications with input/output data transfer, multiple kernels
One long kernel is timed, with execution time of few seconds. No host-device data transfer.
On all applications, the AMD based system performs faster by 8-9 %.
Changing the driver to 256.25 improves performance on both systems, leaving difference between systems unchaged.
We expect CUDA performance to be approximately equal on difference systems, so the observed differences are very strange. Note that in one of our applications, there is no device-host data transfer, so we assume CPU memory should not have any effect.
If anyone has similar observations, or any idea about why this happens, please respond…
Are you sure you have those labeled correctly? On the results labeled “AMD”, you are either only getting 8 PCIe lanes or are limited to PCIex16 1.0 throughput. If this was the slower system, then I would say that the decreased PCIe bandwidth was the cause.
What I would like to see is the Matrix Multiplication example in the SDK for the 480 is it possible for you to do this test?
I get: 81 GFlop/s with my GTX-275 how much do you get?
The matrix multiplication is a nice example to test Memory Transfer since the speed of the execution is heavily based on the Data Transfer to the GTX Processor. From a quick glance to the code I see that the transfer to the matrices in the shared memory is coalesced so it should be a nice example.
It is the MatrixMul.exe in the SDK, kindly run it and post the results, this should be useful.
This is the near peak performance that the GTX-275 can attain in matrix multiplication of course with a 16x16 block (standard). With some few modifications you can make it better but this is the standard. I want to see your results since I plan to buy a GTX-480 card and I am not certain unless I see this example performance.
Newbie equipment question-- Could an 8x lane on a 16x PCIe 1.0 slot (tyan mobo–bah humbug, AMD opteron 2356, F12, GTX 470) lower the H->D and and D->H by more than 2/3? D<–>D also comes in 25%ish lower. I was under expectation the drop off would be more linear.
-----Device to Device Bandwidth, 1 Device(s)--------
This actually tells nothing to me as you can see. Of course there is a performance on Motherboards but who really cares about Device to host since when the Kernel is invoked there is not such thing. If you plan to see Host to Device as a bottleneck then you have failed as a programmer in CUDA. It is as strict as this.
Edit : In general I avoid using the term fail since I know some cases that people have been made a complete redicule claiming this phrase, so I avoid it. But in this case things are very strict when you deal with GPU and its large amount of memory the code should not communicate in tons of MBs with the DDR memory of the motherboard. It is very bad programming. :-)
It is worth pointing out that isn’t correct. The AMD board is a MCP55 Pro chipset board, which is only PCI-e 1.0 compliant, and is why the bandwidth test numbers are lower.
As to why the AMD board is faster, the only thing I can think of is NUMA processor affinity and command latency. It could be that the Opteron arrangement can push commands and data to the device with lower latency than the Intel setup (even though the total bandwidth is lower).
Yes, my results are labelled correctly. Although this seems to be contradictory with timing results, it is not.
The reason is that the tests that we run should not be affected by host-device transfer speed. One of the kernels that we time does not perform host-device transfer. Another application performs host-device transfer in a parallel stream. In this later case transfers take shorter time comparing with the computation kernel.
The computation time differences have some other cause rather than host-device transfer rates.
I was just checking since you seemed to be under the impression that the bandwidth was the same (PCI-E x16 2.0) in both systems. There had been a previous discussion here of anomalous bandwidth results on Tylersburg motherboards (tmurray hypothesized an immature BIOS), although this appears not have been your problem. Avidday’s suggestion also makes sense, and is corroborated by tmurray’s comments in the above thread.