Bandwith problems with S870 and 177.67

Hi,

I just set up an S870 on Centos 5.0 with the 177.67 drivers and Cuda 2.0. It works fine, but I’m getting poor device-device bandwidth results. bandwidthTest from the SDK reports the following:

Running on…
device 0:Tesla C870
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1988.4

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1739.2

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 31036.3

In contrast, the same test on a G80 on my desktop with the same OS, drivers and Cuda gives 65 GB/s. It seems like other people are having the same problem: http://forums.nvidia.com/index.php?showtopic=75817&hl=s870.

The overall performance of my application is about 60% on the S870 as compared to the G80 on my desktop, but the results are still correct.

I’ve attached the output from nvidia-bug-report as well. Does anyone have any ideas as to what might be going wrong?
nvidia_bug_report.log.gz (22 KB)

Its not clear from your post whether you’ve performed the S870 test using the same host system as the discrete G80 test?

Sorry, I should have mentioned that. The S870 is running with a HP Proliant DL160 G5 with a single 2.0 GHz Xeon. The G80 test runs on my desktop which is slightly different: 2.4 GHz Core 2 Duo with an Asus P5N32-E SLI motherboard.

A quick update with something I just noticed. It is only devices 0 and 2 which give 31 GB/s, devices 1 and 3 report 60 GB/s:

[tb302@compute-0-0 ~]$ /share/apps/NVIDIA_CUDA_SDK/bin/linux/release/bandwidthTest --device=0

Running on…

  device 0:Tesla C870

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1989.4

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1737.7

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 31030.3

&&&& Test PASSED

Press ENTER to exit…

[tb302@compute-0-0 ~]$ /share/apps/NVIDIA_CUDA_SDK/bin/linux/release/bandwidthTest --device=1

Running on…

  device 1:Tesla C870

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2069.6

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1749.4

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 60468.6

&&&& Test PASSED

Press ENTER to exit…

[tb302@compute-0-0 ~]$ /share/apps/NVIDIA_CUDA_SDK/bin/linux/release/bandwidthTest --device=2

Running on…

  device 2:Tesla C870

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1992.2

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1737.5

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 31016.8

&&&& Test PASSED

Press ENTER to exit…

[tb302@compute-0-0 ~]$ /share/apps/NVIDIA_CUDA_SDK/bin/linux/release/bandwidthTest --device=3

Running on…

  device 3:Tesla C870

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2069.9

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1749.7

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 60468.6

&&&& Test PASSED

Press ENTER to exit…

I’m seeing something similar on a QuadroPlex Model IV with driver 177.80. GPU 0 reports 30GB/s while GPU1 reports 60GB/s.

Any comments from Nvidia on this?

Perhaps unrelated, but with 177.80 and bandwidthTest, I got reduced bandwidth every second time I ran it as in: Reduced on first try, full at second try, reduced at third, full at fourth … etc.

This does not happen with 177.73

177.73 is the latest CUDA qualified & tested driver.

Check your device clocks with deviceQuery. I bet they are decreased from what they should be.

I have the same problem on a D870 with drivers 177.67.
Well sort of. GPU 0 is down clocked and GPU 1 isn’t in the D870. I’ve got a bug on file with NVIDIA, but nothing has come of it yet.

It seems that the drivers are deciding since the device isn’t attached to a display and isn’t doing anything useful, it should be down clocked to save power :)

I reverted to the CUDA 2.0 beta which works fine until the problem is solved. I haven’t tried any newer versions yet as I have yet to receive a message saying that the bug is closed.

When you say that CUDA 2.0 beta works fine, do you mean that you’re using the older driver?

Yes. I’m running CUDA 2.0beta2 and the corresponding driver 177.13.

I’m seeing the same problem with an S870, CUDA 2.0, 177.73, and a 680i motherboard (P6N Diamond). Devices 0 and 2 are clocked at 1.19GHz and device-device bandwith is half compared to devices 1 and 3. Bandwidth is around 30GB/s on devices 0 and 2 and 60+ GB/s on devices 1 and 3.

I also have the problem on a setup with a D870, CUDA 2.0, 177.73, and an Intel X38 motherboard (DX38BT). Device 0 is clocked at 1.19GHz and has half the bandwidth of device 1.

The problem does not show up on either system when using CUDA 1.1. I haven’t tried 2.0 beta2.

Well, CUDA 2.1 is just around the corner so maybe they completely ignored this problem for 2.0 to fix it in 2.1 fingers crossed. It seems a shame to me that a majority of the original Tesla line cannot be used with CUDA 2.0 in a production setup, despite the problem being reported from day one. Stupid if you ask me.

If this persists with 2.1, you can be sure that I’ll be making a lot more noise about it.

Have you tried the S1070 driver (I know, seems weird)? It’s in our bug database as fixed and should be in the S1070 driver (177.70.18 or whatever), but for whatever reason it’s apparently not in 177.73 or 177.80 as far as I can tell. I’ve been told that 2.1 drivers will definitely contain the fix, though.

Odd, it doesn’t show as fixed in my bug view. Maybe it it is tagged that way in the internal system. That, and browsing through I noticed at least one other duplicate (though for 177.73 where the bug I posted mentioned the previous driver version).

Thanks for the info on 177.70.18. It works like a charm on the D870 I’ve got here.

Yours is marked as a duplicate of the S870 bug, and I don’t know why no one told you that… anyway, glad to hear that it works on a D870. That fix is definitely in the 2.1 beta driver.

I’ve just discovered this problem with my Tesla S870 and the 177.82 driver. Two of the GPUs are attached to one IBM x3755 and the other two are attached to a different x3755. One x3755 also has a Quadro FX 5800 and the other a 5600. For each x3755, one of the Tesla GPUs clocks at 1.35 GHz and the other at 1.19 GHz. The slower-clocked GPU reports a device-to-device bandwidth that is around half what it should be (only ~30 GHz instead of ~60).

When will there be a fix?

Both 177.70.18 and 180.22 contain the bandwidth fix. (Well, 177.70.18 definitely does, I haven’t tried 180.22, but The Powers That Be tell me that it does)

Thanks. 180.22 didn’t work, but 177.70.18 does.

Thanks for that–I am making inquiries.