low transfer bandwidth between CPU and GPU my GTX 580 has a slow transfer speed

Moon_W · July 26, 2011, 11:26am

Hi there,

I used GTX 580 to test the sample code “bandwidthTest”, both pageable and pinned memory mode.
I found that the speed seems unreasonably slow.
Does anyone can help me to know what happened?
The CPU I use is Intel Core 2 Quad CPU Q6600.
Thanks in advance.

The following is the reports:

Device 0: GeForce GTX 580
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1465.0

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1129.5

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 143179.3

[bandwidthTest] - Test results:
PASSED

Host to Device Bandwidth, 1 Device(s), Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2509.3

Device to Host Bandwidth, 1 Device(s), Pinned memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1777.1

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 143179.3

[bandwidthTest] - Test results:
PASSED

alrikai · August 2, 2011, 4:42pm

I also have a GTX 580; when I run the bandwidth test from the SDK, I get:

[bandwidthTest.exe] starting…
c:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.0\C\bin\win32\Debug
\bandwidthTest.exe Starting…

Running on…

Device 0: GeForce GTX 580
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2186.8

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2268.9

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 141263.9

[bandwidthTest.exe] test results…
PASSED

Press ENTER to exit…

I’d be interested to know why our numbers differ; would you happen to know what kind of PCI-E slot your card is in?

brano · August 4, 2011, 11:05am

Hi,

Could you post the system spec.

Motherboard?

Memory? (size and spec.)

I also have a GTX 580; when I run the bandwidth test from the SDK, I get:

[bandwidthTest.exe] starting…

c:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.0\C\bin\win32\Debug

\bandwidthTest.exe Starting…

Running on…

Device 0: GeForce GTX 580

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2186.8

Device to Host Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2268.9

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 141263.9

[bandwidthTest.exe] test results…

PASSED

Press ENTER to exit…

I’d be interested to know why our numbers differ; would you happen to know what kind of PCI-E slot your card is in?

Gert-Jan · August 5, 2011, 8:11am

In my opinion both Moon W and alrikai’s numbers are low. When I run the bandwidth test on our main GPU computer (with two GTX 470s and a 8800GT), I get the following results:

Running on...

Device 0: GeForce GTX 470

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     5250.2

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     4341.5

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     94142.0

[bandwidthTest] - Test results:

PASSED

Even the 8800GT in this machine gets similar Host-Device bandwidth numbers (5095 MB/s HtD and 4115 MB/s DtH).

The motherboard in this machine is the Asus P6T7 WS SuperComputer motherboard (http://www.asus.com/Motherboards/Intel_Socket_1366/P6T7_WS_SuperComputer/) and the CPU is an Intel Core i7 930.

On another computer, we have a GTX 460 and the same Intel Q6600 as Moon W has in his PC on some consumer motherboard. Here the bandwidth test results are:

Running on......

      device 0:GeForce GTX 460

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               1351.6

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               1161.0

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               56136.7

&&&& Test PASSED

Much lower than the other PC.

So I bet you should blame your motherboard (PCI-e v1 or v2) or CPU (memory bandwidth), it is probably the bottleneck for Host-Device communication in your PC at the moment.

dmyablonski · August 8, 2011, 8:28pm

I’m interested in more information regarding this.

I have a GTX 460SE with CUDA 4.0 in a SuperMicro PC (server class Nehalem Xeons, dual socket, 32GB DDR3@1066) with a supermicro X8DAH+ motherboard documented as having a PCI 2.0 x 16. I have verified I’m in a PCI2.0x16 slot, and anything else obvious. These are my results (note the half speed on device to host…? and slow speed on host to device.

[bandwidthTest]

./bandwidthTest Starting...

Running on...

Device 0: GeForce GTX 460

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			3636.5

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			1743.0

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			59663.7

[bandwidthTest] - Test results:

PASSED

Press <Enter> to Quit...

-----------------------------------------------------------

[bandwidthTest]

./bandwidthTest Starting...

Running on...

Device 0: GeForce GTX 460

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Pinned memory

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			4599.1

Device to Host Bandwidth, 1 Device(s), Pinned memory

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			1822.3

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			59693.7

[bandwidthTest] - Test results:

PASSED

Press <Enter> to Quit...

In my opinion both Moon W and alrikai’s numbers are low. When I run the bandwidth test on our main GPU computer (with two GTX 470s and a 8800GT), I get the following results:
Running on...

Device 0: GeForce GTX 470

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     5250.2

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     4341.5

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     94142.0

[bandwidthTest] - Test results:

PASSED
Even the 8800GT in this machine gets similar Host-Device bandwidth numbers (5095 MB/s HtD and 4115 MB/s DtH).

The motherboard in this machine is the Asus P6T7 WS SuperComputer motherboard (http://www.asus.com/Motherboards/Intel_Socket_1366/P6T7_WS_SuperComputer/) and the CPU is an Intel Core i7 930.

On another computer, we have a GTX 460 and the same Intel Q6600 as Moon W has in his PC on some consumer motherboard. Here the bandwidth test results are:
Running on......

      device 0:GeForce GTX 460

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               1351.6

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               1161.0

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               56136.7

&&&& Test PASSED
Much lower than the other PC.

So I bet you should blame your motherboard (PCI-e v1 or v2) or CPU (memory bandwidth), it is probably the bottleneck for Host-Device communication in your PC at the moment.

brano · August 9, 2011, 7:14am

Hi,

What slot do you have the GTX 460SE in?

On the X8DAH+ MB you have a PCI-e slot x16 that only connects 8 electrical lanes to it. Don’t use that one (slot 4).

How many CPUs are you using? CPU model?

How are you populating your DIMMs (memory)?

You could read the chapter “DIMM Module Population Conï¬guration” for your motherboard in order to get the optimal configuration for performance.

Gert-Jan · August 9, 2011, 8:00am

I compared the datasheet of the Intel 5520 chipset on your SuperMicro motherboard with the datasheet of the X58 chipset on my motherboard, but they look very similar. The main difference I found (after a quick look) was that the 5520 supports dual-CPU systems, and the X58 is only single CPU.

One thing I can think of (but I’m not sure this has any impact on performance) is that on your motherboard data has to travel via more chips and interfaces to get from the CPU to the GPU than in my motherboard. Maybe you can try to run your program on the CPU which is closest the GPU, and see if this has any performance impact (just guessing here). (Maybe you can swap the GPU from one (true) x16 slot to the other, or maybe even remove one CPU, and see what that will do for performance). Some illustrations below:

Asus motherboard:

Memory <--> CPU <--> chipset <--> GPU

SuperMicro motherboard:

Memory_1 <--> CPU_1 <--> chipset <-->  GPU

               |            |

               |            |

Memory_2 <--> CPU_2 <--> chipset <--> (GPU)

It would be nice if there would be a list (possibly from NVIDIA) which would give these kind of CPU ↔ GPU bandwidth numbers, at least for professional workstation / server hardware (i.e. Dual Xeon ↔ Tesla bandwidth).

dmyablonski · August 9, 2011, 11:56am

The GPU is in slot 6. The x8 in a x16 slot is marked on the board, as are the rest. I had checked the manual to make sure that I chose a true 2.0x16 slot.

I actually have 24GB (not 32GB sorry) of RAM, 6 4GB sticks of PC3-10600R, in P1 DIMM1A, 1B and 1C, and P2 DIMM1A, 1B and 1C as I believe thats how the documentation said was preferable.

Processor is Xeon E5520 @ 2.27GHz.

I tested this on 3 other systems of the same setup too and got similar results. In a normal dell workstation, the card gets about 5.5 up and down as I would’ve hoped.

dmyablonski · August 9, 2011, 12:21pm

I would think with QPI between the processors I would be able to satisfy ~5GB/s with out any trouble… but to your point:

(btw, numactl lets you control where your process runs and where it stores its memory) EDIT - was using physcpubind instead of cpunodebind originally… fixed results now.

I tested with the following commands, just now:

GPU in slot 6:

[b]CPU 0, with memory off CPU 0.

LD_LIBRARY_PATH=/usr/local/cuda/lib64/ numactl --cpunodebind=0 --membind 0 ./bandwidthTest --memory=pinned[/b]

GPU in slot 6:

Host-Device = 4600.6 MB/s

Device-Host = 1822.1 MB/s

(unpinned memory)

Host-Device = 3636.8 MB/s

Device-Host = 1736.5 MB/s

GPU in slot 2:

Host-Device = 5728.8 MB/s

Device-Host = 3110.1 MB/s

(unpinned memory)

Host-Device = 3647.1 MB/s

Device-Host = 2700.0 MB/s

[b]CPU 1, with memory off CPU 1.

LD_LIBRARY_PATH=/usr/local/cuda/lib64/ numactl --cpunodebind=1 --membind 1 ./bandwidthTest --memory=pinned

[/b]

GPU in slot 6:

Host-Device = 5731.2 MB/s

Device-Host = 3071.8 MB/s

(unpinned memory)

Host-Device = 3661.3 MB/s

Device-Host = 2685.4 MB/s

GPU in slot 2:

Host-Device = 4552.4 MB/s

Device-Host = 1807.5 MB/s

(unpinned memory)

Host-Device = 3639.0 MB/s

Device-Host = 1727.0 MB/s

Unpinned performance is horrid off of either cpu. For Slot 6 It got a little better all off CPU 1 with pinned memory, but the Device-host is still slow… and a standard dell workstation (using similar generation i7) gets better speeds without using pinned memory.

Slot 2 was bad too though worse and better in places… I can’t trust this machine for benchmarking with such wild and poor results…

Could it be some other issue with the chipset? Furthermore, I have two (they are both new also) of these GPUs and they have the same behavior.

To the OP - sorry if this is considered thread hi-jacking, but hopefully this gives you things to try as well.

I compared the datasheet of the Intel 5520 chipset on your SuperMicro motherboard with the datasheet of the X58 chipset on my motherboard, but they look very similar. The main difference I found (after a quick look) was that the 5520 supports dual-CPU systems, and the X58 is only single CPU.

One thing I can think of (but I’m not sure this has any impact on performance) is that on your motherboard data has to travel via more chips and interfaces to get from the CPU to the GPU than in my motherboard. Maybe you can try to run your program on the CPU which is closest the GPU, and see if this has any performance impact (just guessing here). (Maybe you can swap the GPU from one (true) x16 slot to the other, or maybe even remove one CPU, and see what that will do for performance). Some illustrations below:

Asus motherboard:
Memory <--> CPU <--> chipset <--> GPU
SuperMicro motherboard:
Memory_1 <--> CPU_1 <--> chipset <-->  GPU

               |            |

               |            |

Memory_2 <--> CPU_2 <--> chipset <--> (GPU)
It would be nice if there would be a list (possibly from NVIDIA) which would give these kind of CPU ↔ GPU bandwidth numbers, at least for professional workstation / server hardware (i.e. Dual Xeon ↔ Tesla bandwidth).

Moon_W · August 10, 2011, 6:31am

Thanks Gert-Jan for testing two kinds of computers!
My CPU is Intel Q6600 and motherboard is ASUS P5E-VM DO
(｜｜ASUS Global)
which has PCI-e v1.1.
A very old computer…
I guess this old CPU plus old PCI-e is the reason for my low transfer speed.

Topic		Replies	Views
Memory bandwidth CUDA Programming and Performance	31	38525	October 5, 2007
PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be CUDA Programming and Performance	38	68152	April 18, 2008
very low PCIe bandwidth CUDA Programming and Performance	9	3493	March 2, 2010
Is this PCIe 2.0 bandwidth low? 3.1 GB/s pinned CUDA Programming and Performance	45	20199	December 28, 2008
The fastest platform of GPU computing CUDA Programming and Performance	38	40366	January 19, 2010
GTX480 performance on different motherboards performance differs on AMD and INTEL motherboards CUDA Programming and Performance	15	18402	June 7, 2010
Bandwidht Usage CUDA Programming and Performance	16	8933	October 30, 2008
x58 Chipset PCIE Bandwidth Any improvement? CUDA Programming and Performance	47	22083	December 14, 2008
Host to Device Memroy Bandwidth CUDA Programming and Performance	18	8021	September 12, 2008
Extremely low bandwidth CUDA Programming and Performance	10	2011	September 4, 2010

low transfer bandwidth between CPU and GPU my GTX 580 has a slow transfer speed

Related topics