Bandwidht Usage

I have noticed that the bandwidth used in all the SDK programs i run on the 8800 GT card that I have ranges between 1.5 and 2.5 GB/s averaging 1.8 GB/s. I am lost here and quite clueless in figuring what the reason might be.

Possible that the PCI-E slot is not having that bandwidth…

People here talk about some X2, X4 etc… when it come to PCI-E bus. I dont know what it is.

It could also be with your BIOS settings.

Check out on those lines…

May b, the best brute force way would be to just change the PCI-E Slot and re-try your program… If it does not work – try checking the X2,X4 thing and the BIOS settings.

I think shwetha is talking about device to device bandwidth. Have you tried different drivers? What drivers are you running now?

It’s also possible the card is borked and should be replaced.

Along similar lines, what are reasonable bandwidths for a GTX280 OC,

this is my bandwidth test output and seems relatively low compared to what I’ve read elsewhere on these forums (~1.3 GB/sec for non paged-locked and up to 3GB/sec for paged-locked)

My current application uses paged-locked and only manages ~0.625GB / second - allocated with cudaMallocHost() and transferred with cudaMemcpyAsync

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               612.9

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               789.5

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               114983.8

&&&& Test PASSED

and

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               637.2

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               814.1

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               114163.4

&&&& Test PASSED

Thanks.

It is not a device to device bandwidth. Even a host to device memory copy with cudaMemcpyHostToDevice (1 million elements)gives only a ~2.0 GB/s bandwidth.

PCI-E is still the gateway between device to host as well. Frankly, I dont know much about X2 thing… But I am 50% sure that this can affect host-device memory transfers as well.

Mr. Anderson – would you like to comment here?

When i run bandWidthTest, I get 1993.8 Mb/s for Host to Device transfer, 2030.6 Mb/s for Device to host and 3989.0 Mb/s for device to device.

Aren’t these values quite low? How does the parallel reduction in the CUDA SDK examples on kernel 6 give a bandwidth of 40GB/s then? Is there any connection between the two or am I missing something here?

Your host ↔ device values are fine. Your device to device value is low by a factor of 10 or more. This is usually a case of the card being down-clocked by power management or a crashed driver.

Contrary to what some others have said on this thread, the device to device numbers printed by bandwidthTest benchmark a copy from one region of a device to another, and NOT from one device to another physical device. When the memory is kept on the GPU, any kernel or memcpy operation can move memory around at very near the peak device memory bandwidth.

It gets 40 GiB/s because the reduction example is limited by the memory performance of the device and the peak memory performance of an 8800 GT is 57 GiB/s (sustained bandwidths at 20% less than peak are typical).

Thank you Mr.Anderson.

Now i see the point… She was referring to Host-device mem bandwidth which would generally be less compared to the intra-device memory bandwidh which is much much greater.

I remember a thread in which some1 from NVIDIA had said that the “device mem - device mem” bandwidth had actually reduced in some driver. He said they were working on it.

I dont know whats the current status on it and what the latest driver offers.

I checked my PCI E slots as well. I have a Quadro FX370 on the x8 slot and 8800GT card on the x16 slot. Will swapping the cards with 8800GT on x8 improve my device - device bandwidths?

COmmon sense says x16 will be wider than x8 i.e. faster… So swapping might actually cut down the bandwidth… It would be a fun exercise to try it out though…

btw the intra-device factor being very low could actually be a problem with the driver. which driver version are you using? some nvidia guy might be able to help out…

Make sure you have the latest BIOS–often makes a pretty significant difference.

Hope you are talking about Host-device & device-host bandwidth…

Can you tell something about the device-device memory bandwidth which has slowed down in her experiments…??

I remember you talking about some driver refinement that had actually caused a reduction in this bandwidth.

That was a ~10% reduction if I recall correctly. Not a 10x reduction as the OP is seeing. I’ve only seen that drastic of a change after the driver had had a major crash (i.e. X w/ compositig freezes the GPU up).

Hi all,

I’d have a related question:

in the (great) presentation from NVision08 “Advanced CUDA” talk, some estimations have been done on how fast CPU-to-GPU memory transfers can expected to be:

So far, so good. My first questions:

(1) where does this number “2 *” come from? I suppose it is due to an estimation of both CPU-to-GPU and GPU-to-CPU transfer, right?

(2) where does the PCI-e speed of 5.2 GB/s come from?

 - my best guess would be it is due to the PCI-e spec: however, let's say I have 16x, and PCI-e 2.0, the speed should more likely be 500 MB/s * 16 = 8 GB/s?

 - my second guess is, it takes the actual numbers from the "bandwidth SDK example" (e.g., output for Host to Device Memory bandwidth)

I have applied above calculation to my transfer size of 200 MB:

2 * 200 MB / 5.2 GB/s = 75 ms (bothways?)

(3) However, during actual execution my profiler tells me that a single host-to-device copy takes already 80 ms (which would mean 160 ms bothways?); note: I use a GTX280, and it is stuck into a PCI-e 2.0 x16 slot (which provides a theoretical maximum of 8 GB/s as mentioned above).

(4) is it realistic that there is such a big difference between the time taken for host-to-device vs. device-to-host copy? for the same array being transferred, I get 80 ms host2device vs. 118 ms device2host (which is almost 1.5x as slow)?

Ok, I have done further calculations. First, see the output of the bandwidth example for my GTX280:

Running on......

	  device 0:GeForce GTX 280

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   2112.8

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   1582.7

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   118694.4

With that in mind, I have updated my calculations:

host to device: 200 MB / 2112.8 MB/s = 95 ms estimated vs. 80 ms measured

device to host: 200 MB / 1582.7 MB/s = 126 ms estimated vs. 118 ms measured

Can somebody confirm now, if that is the correct way to compute/estimate CPU-GPU-memory transfer times?

(And if so, provide an explanation why these measured PCI-e rates are roughly 4 times lower than the maximum theoretical transfer rate for PCI 2.0 x16 of 8 MB/s?)

Thanks,

Michael

Does anybody know any answer to my questions (especially (1) and (2) of my previous post)?

Any hints greatly appreciated!