I used GTX 580 to test the sample code “bandwidthTest”, both pageable and pinned memory mode.
I found that the speed seems unreasonably slow.
Does anyone can help me to know what happened?
The CPU I use is Intel Core 2 Quad CPU Q6600.
Thanks in advance.
The following is the reports:
Device 0: GeForce GTX 580
Quick Mode
Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1465.0
Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1129.5
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 143179.3
[bandwidthTest] - Test results:
PASSED
Host to Device Bandwidth, 1 Device(s), Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2509.3
Device to Host Bandwidth, 1 Device(s), Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1777.1
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 143179.3
In my opinion both Moon W and alrikai’s numbers are low. When I run the bandwidth test on our main GPU computer (with two GTX 470s and a 8800GT), I get the following results:
Running on...
Device 0: GeForce GTX 470
Quick Mode
Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5250.2
Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4341.5
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 94142.0
[bandwidthTest] - Test results:
PASSED
Even the 8800GT in this machine gets similar Host-Device bandwidth numbers (5095 MB/s HtD and 4115 MB/s DtH).
On another computer, we have a GTX 460 and the same Intel Q6600 as Moon W has in his PC on some consumer motherboard. Here the bandwidth test results are:
Running on......
device 0:GeForce GTX 460
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1351.6
Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1161.0
Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 56136.7
&&&& Test PASSED
Much lower than the other PC.
So I bet you should blame your motherboard (PCI-e v1 or v2) or CPU (memory bandwidth), it is probably the bottleneck for Host-Device communication in your PC at the moment.
I’m interested in more information regarding this.
I have a GTX 460SE with CUDA 4.0 in a SuperMicro PC (server class Nehalem Xeons, dual socket, 32GB DDR3@1066) with a supermicro X8DAH+ motherboard documented as having a PCI 2.0 x 16. I have verified I’m in a PCI2.0x16 slot, and anything else obvious. These are my results (note the half speed on device to host…? and slow speed on host to device.
[bandwidthTest]
./bandwidthTest Starting...
Running on...
Device 0: GeForce GTX 460
Quick Mode
Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3636.5
Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1743.0
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 59663.7
[bandwidthTest] - Test results:
PASSED
Press <Enter> to Quit...
-----------------------------------------------------------
[bandwidthTest]
./bandwidthTest Starting...
Running on...
Device 0: GeForce GTX 460
Quick Mode
Host to Device Bandwidth, 1 Device(s), Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4599.1
Device to Host Bandwidth, 1 Device(s), Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1822.3
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 59693.7
[bandwidthTest] - Test results:
PASSED
Press <Enter> to Quit...
I compared the datasheet of the Intel 5520 chipset on your SuperMicro motherboard with the datasheet of the X58 chipset on my motherboard, but they look very similar. The main difference I found (after a quick look) was that the 5520 supports dual-CPU systems, and the X58 is only single CPU.
One thing I can think of (but I’m not sure this has any impact on performance) is that on your motherboard data has to travel via more chips and interfaces to get from the CPU to the GPU than in my motherboard. Maybe you can try to run your program on the CPU which is closest the GPU, and see if this has any performance impact (just guessing here). (Maybe you can swap the GPU from one (true) x16 slot to the other, or maybe even remove one CPU, and see what that will do for performance). Some illustrations below:
It would be nice if there would be a list (possibly from NVIDIA) which would give these kind of CPU ↔ GPU bandwidth numbers, at least for professional workstation / server hardware (i.e. Dual Xeon ↔ Tesla bandwidth).
The GPU is in slot 6. The x8 in a x16 slot is marked on the board, as are the rest. I had checked the manual to make sure that I chose a true 2.0x16 slot.
I actually have 24GB (not 32GB sorry) of RAM, 6 4GB sticks of PC3-10600R, in P1 DIMM1A, 1B and 1C, and P2 DIMM1A, 1B and 1C as I believe thats how the documentation said was preferable.
Processor is Xeon E5520 @ 2.27GHz.
I tested this on 3 other systems of the same setup too and got similar results. In a normal dell workstation, the card gets about 5.5 up and down as I would’ve hoped.
I would think with QPI between the processors I would be able to satisfy ~5GB/s with out any trouble… but to your point:
(btw, numactl lets you control where your process runs and where it stores its memory) EDIT - was using physcpubind instead of cpunodebind originally… fixed results now.
I tested with the following commands, just now:
Unpinned performance is horrid off of either cpu. For Slot 6 It got a little better all off CPU 1 with pinned memory, but the Device-host is still slow… and a standard dell workstation (using similar generation i7) gets better speeds without using pinned memory.
Slot 2 was bad too though worse and better in places… I can’t trust this machine for benchmarking with such wild and poor results…
Could it be some other issue with the chipset? Furthermore, I have two (they are both new also) of these GPUs and they have the same behavior.
To the OP - sorry if this is considered thread hi-jacking, but hopefully this gives you things to try as well.
Thanks Gert-Jan for testing two kinds of computers!
My CPU is Intel Q6600 and motherboard is ASUS P5E-VM DO
(||ASUS Global)
which has PCI-e v1.1.
A very old computer…
I guess this old CPU plus old PCI-e is the reason for my low transfer speed.