Bandwidht Usage

shwetha86 · October 14, 2008, 3:59am

I have noticed that the bandwidth used in all the SDK programs i run on the 8800 GT card that I have ranges between 1.5 and 2.5 GB/s averaging 1.8 GB/s. I am lost here and quite clueless in figuring what the reason might be.

Sarnath · October 14, 2008, 4:39am

Possible that the PCI-E slot is not having that bandwidth…

People here talk about some X2, X4 etc… when it come to PCI-E bus. I dont know what it is.

It could also be with your BIOS settings.

Check out on those lines…

May b, the best brute force way would be to just change the PCI-E Slot and re-try your program… If it does not work – try checking the X2,X4 thing and the BIOS settings.

_Big_Mac · October 14, 2008, 8:49am

I think shwetha is talking about device to device bandwidth. Have you tried different drivers? What drivers are you running now?

It’s also possible the card is borked and should be replaced.

cudaUser3 · October 14, 2008, 9:23am

Along similar lines, what are reasonable bandwidths for a GTX280 OC,

this is my bandwidth test output and seems relatively low compared to what I’ve read elsewhere on these forums (~1.3 GB/sec for non paged-locked and up to 3GB/sec for paged-locked)

My current application uses paged-locked and only manages ~0.625GB / second - allocated with cudaMallocHost() and transferred with cudaMemcpyAsync

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               612.9

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               789.5

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               114983.8

&&&& Test PASSED

and

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               637.2

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               814.1

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               114163.4

&&&& Test PASSED

Thanks.

shwetha86 · October 14, 2008, 9:53am

It is not a device to device bandwidth. Even a host to device memory copy with cudaMemcpyHostToDevice (1 million elements)gives only a ~2.0 GB/s bandwidth.

Sarnath · October 14, 2008, 10:37am

PCI-E is still the gateway between device to host as well. Frankly, I dont know much about X2 thing… But I am 50% sure that this can affect host-device memory transfers as well.

Mr. Anderson – would you like to comment here?

shwetha86 · October 14, 2008, 11:08am

When i run bandWidthTest, I get 1993.8 Mb/s for Host to Device transfer, 2030.6 Mb/s for Device to host and 3989.0 Mb/s for device to device.

Aren’t these values quite low? How does the parallel reduction in the CUDA SDK examples on kernel 6 give a bandwidth of 40GB/s then? Is there any connection between the two or am I missing something here?

MisterAnderson42 · October 14, 2008, 11:33am

Your host ↔ device values are fine. Your device to device value is low by a factor of 10 or more. This is usually a case of the card being down-clocked by power management or a crashed driver.

Contrary to what some others have said on this thread, the device to device numbers printed by bandwidthTest benchmark a copy from one region of a device to another, and NOT from one device to another physical device. When the memory is kept on the GPU, any kernel or memcpy operation can move memory around at very near the peak device memory bandwidth.

It gets 40 GiB/s because the reduction example is limited by the memory performance of the device and the peak memory performance of an 8800 GT is 57 GiB/s (sustained bandwidths at 20% less than peak are typical).

Sarnath · October 14, 2008, 11:39am

Thank you Mr.Anderson.

Now i see the point… She was referring to Host-device mem bandwidth which would generally be less compared to the intra-device memory bandwidh which is much much greater.

I remember a thread in which some1 from NVIDIA had said that the “device mem - device mem” bandwidth had actually reduced in some driver. He said they were working on it.

I dont know whats the current status on it and what the latest driver offers.

shwetha86 · October 15, 2008, 4:24am

I checked my PCI E slots as well. I have a Quadro FX370 on the x8 slot and 8800GT card on the x16 slot. Will swapping the cards with 8800GT on x8 improve my device - device bandwidths?

Sarnath · October 15, 2008, 5:30am

COmmon sense says x16 will be wider than x8 i.e. faster… So swapping might actually cut down the bandwidth… It would be a fun exercise to try it out though…

btw the intra-device factor being very low could actually be a problem with the driver. which driver version are you using? some nvidia guy might be able to help out…

tmurray · October 15, 2008, 5:35am

Make sure you have the latest BIOS–often makes a pretty significant difference.

Sarnath · October 15, 2008, 5:48am

Hope you are talking about Host-device & device-host bandwidth…

Can you tell something about the device-device memory bandwidth which has slowed down in her experiments…??

I remember you talking about some driver refinement that had actually caused a reduction in this bandwidth.

MisterAnderson42 · October 15, 2008, 12:17pm

That was a ~10% reduction if I recall correctly. Not a 10x reduction as the OP is seeing. I’ve only seen that drastic of a change after the driver had had a major crash (i.e. X w/ compositig freezes the GPU up).

TheLetti · October 29, 2008, 9:05pm

Hi all,

I’d have a related question:

in the (great) presentation from NVision08 “Advanced CUDA” talk, some estimations have been done on how fast CPU-to-GPU memory transfers can expected to be:

So far, so good. My first questions:

(1) where does this number “2 *” come from? I suppose it is due to an estimation of both CPU-to-GPU and GPU-to-CPU transfer, right?

(2) where does the PCI-e speed of 5.2 GB/s come from?

 - my best guess would be it is due to the PCI-e spec: however, let's say I have 16x, and PCI-e 2.0, the speed should more likely be 500 MB/s * 16 = 8 GB/s?

 - my second guess is, it takes the actual numbers from the "bandwidth SDK example" (e.g., output for Host to Device Memory bandwidth)

I have applied above calculation to my transfer size of 200 MB:

2 * 200 MB / 5.2 GB/s = 75 ms (bothways?)

(3) However, during actual execution my profiler tells me that a single host-to-device copy takes already 80 ms (which would mean 160 ms bothways?); note: I use a GTX280, and it is stuck into a PCI-e 2.0 x16 slot (which provides a theoretical maximum of 8 GB/s as mentioned above).

(4) is it realistic that there is such a big difference between the time taken for host-to-device vs. device-to-host copy? for the same array being transferred, I get 80 ms host2device vs. 118 ms device2host (which is almost 1.5x as slow)?

TheLetti · October 29, 2008, 9:46pm

Hi all,

I’d have a related question:

in the (great) presentation from NVision08 “Advanced CUDA” talk, some estimations have been done on how fast CPU-to-GPU memory transfers can expected to be:

So far, so good. My first questions:

(1) where does this number “2 *” come from? I suppose it is due to an estimation of both CPU-to-GPU and GPU-to-CPU transfer, right?

(2) where does the PCI-e speed of 5.2 GB/s come from?
 - my best guess would be it is due to the PCI-e spec: however, let's say I have 16x, and PCI-e 2.0, the speed should more likely be 500 MB/s * 16 = 8 GB/s?

 - my second guess is, it takes the actual numbers from the "bandwidth SDK example" (e.g., output for Host to Device Memory bandwidth)
I have applied above calculation to my transfer size of 200 MB:

2 * 200 MB / 5.2 GB/s = 75 ms (bothways?)

(3) However, during actual execution my profiler tells me that a single host-to-device copy takes already 80 ms (which would mean 160 ms bothways?); note: I use a GTX280, and it is stuck into a PCI-e 2.0 x16 slot (which provides a theoretical maximum of 8 GB/s as mentioned above).

(4) is it realistic that there is such a big difference between the time taken for host-to-device vs. device-to-host copy? for the same array being transferred, I get 80 ms host2device vs. 118 ms device2host (which is almost 1.5x as slow)?

Ok, I have done further calculations. First, see the output of the bandwidth example for my GTX280:

Running on......

	  device 0:GeForce GTX 280

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   2112.8

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   1582.7

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   118694.4

With that in mind, I have updated my calculations:

host to device: 200 MB / 2112.8 MB/s = 95 ms estimated vs. 80 ms measured

device to host: 200 MB / 1582.7 MB/s = 126 ms estimated vs. 118 ms measured

Can somebody confirm now, if that is the correct way to compute/estimate CPU-GPU-memory transfer times?

(And if so, provide an explanation why these measured PCI-e rates are roughly 4 times lower than the maximum theoretical transfer rate for PCI 2.0 x16 of 8 MB/s?)

Thanks,

Michael

TheLetti · October 30, 2008, 7:41pm

Does anybody know any answer to my questions (especially (1) and (2) of my previous post)?

Any hints greatly appreciated!

Topic		Replies	Views
PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be CUDA Programming and Performance	38	68056	April 18, 2008
bandwith performance on PCI-E v1 slow? CUDA Programming and Performance	3	874	May 15, 2008
Host to Device Memroy Bandwidth CUDA Programming and Performance	18	7990	September 12, 2008
device memory bandwidth issues with 177.67 lower then expected CUDA Programming and Performance	7	5366	October 5, 2008
Memory bandwidth CUDA Programming and Performance	31	38411	October 5, 2007
Extremely low bandwidth CUDA Programming and Performance	10	1975	September 4, 2010
low transfer bandwidth between CPU and GPU my GTX 580 has a slow transfer speed CUDA Programming and Performance	9	3651	August 10, 2011
Low Aggregate PCI Bandwidth for 9800GX2 CUDA Programming and Performance	14	22133	September 16, 2008
Bandwidth problem ? Could anyone verify that this is normal? CUDA Programming and Performance	7	3579	April 25, 2008
Host<-> device bandwidth problems slow and intermittent bandwidth on linux CUDA Programming and Performance	9	6709	January 8, 2008

Bandwidht Usage

Related topics