Host<-> device bandwidth problems slow and intermittent bandwidth on linux

I built a new linux box using centos (based on RHEL 4.5). All the CUDA examples work OK as does our newly ported (from windows) CUDA application.

I am seeing a problem though with memory bandwidth between the G80 and the CPU.

The system is an older (3 years old) motherboard with PCI express x16 and intel 945 chipset using an 8800GTS and I would expect to get better bandwidth results.

I am seeing 2 issues:

  1. Speed. I expect twice the performance compared to numbers I am seeing. I get about twice the bandwidth using windows on the same machine.
  2. Consistency. Transfer sizes under 700k vary greatly, i.e. notice below how the bandwidth goes up and down even as transfer sizes that increase.

When I run the bandwidthtest example, I get the following results:

./bandwidthTest --memory=pinned --mode=shmoo
Shmoo Mode
Host to Device Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)
1024 32.7
2048 66.0
3072 97.3
4096 79.9
5120 158.5
6144 184.8
7168 215.0
8192 154.7
9216 267.1
10240 286.4
11264 310.5
12288 128.2
13312 358.6
14336 379.8
15360 264.9
16384 420.0
17408 442.7
18432 450.7
19456 323.8
20480 489.5
22528 371.7
24576 150.9
26624 116.2
28672 192.0
30720 461.4
32768 484.5
34816 708.0
36864 743.3
38912 752.7
40960 556.4
43008 802.7
45056 836.0
47104 621.3
49152 866.5
51200 904.2
61440 996.5
71680 269.6
81920 303.5
92160 390.3
102400 826.2
204800 1041.7
307200 1209.1
409600 1537.9
512000 1696.0
614400 1642.7
716800 1740.3
819200 1774.4
921600 1793.7
1024000 1742.3

67186688 1902.0

Shmoo Mode
Device to Host Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)
1024 38.3
2048 74.0
3072 109.7
4096 143.1
5120 176.9
6144 85.7
7168 239.9
8192 263.0
9216 291.0
10240 317.1
11264 346.5
12288 361.7
13312 381.2
14336 400.9
15360 442.6
16384 463.6
17408 489.7
18432 509.5
19456 530.1
20480 551.7
22528 585.4
24576 620.0
26624 651.0
28672 687.0
30720 712.8
32768 742.3
34816 772.2
36864 791.8
38912 90.0
40960 157.6
43008 878.3
45056 900.8
47104 920.5
49152 935.6
51200 955.5
61440 1033.4

62992384 1758.7
67186688 1758.6

Any ideas? Anyone else seeing these problems?

Hi jesser,

I’m afraid I won’t be of much help, but I am having what appears to be the same problem. My configuration:

Dell Precision 370 Mini-Tower

2 GB RAM

CentOS 4.5

Latest (v1.1 Toolkit and SDK, v169-04) Drivers and tools

Kernel: 2.6.9-55

GPU: 8800 GTX

I have confirmed that my card is in (the only) PCI-E slot in the machine. From dmesg and lspci -vvv outputs, I can confirm that it is (or at least the OS believes it is) operating in x16 mode.

When I run the bandwidth test in non-pinned, (pageable) mode, I obtain:

gpu-server1:~/sdk/bin/linux/release> bandwidthTest

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               628.4

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               787.8

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               70469.1

&&&& Test PASSED

And when I run in pinned (–memory=pinned) mode, I obtain

gpu-server1:~/sdk/bin/linux/release> bandwidthTest --memory=pinned

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               708.2

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               850.1

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               70438.0

&&&& Test PASSED

…which represents only a fractional increase in speed. From other posts in these forums, I anticipated a 2-3x improvement in speed between the pinned and non-pinned memory bandwidths (expecting somewhere between 2-3 GB/sec bandwidth between host and device). 700 MB/sec clearly (?) represents some sort of device or configuration failure.

Does anyone have any configuration pointers on this issue? Since many of the people in these forums are successfully using the cards with the same NVIDIA drivers and SDK tools, I suspect that the problem must lie somewhere in the OS configuration (perhaps a kernel configuration parameter that is not set?).

Any help would be appreciated.

Please generate and attach an nvidia-bug-report.log.

Hi netllama,

Bug report is attached - thanks for your quick response!
nvidia_bug_report.log.txt (114 KB)

I’m not seeing anything that would suggest a CUDA bug.

Have you verified that you’re using the latest motherboard BIOS?
Do the results differ if X is not running?

Hi netllama,

The BIOS is the most current, and all packages are up-to-date. I just ran the bandwidth test again after exiting out of X, and still got the same results.

Any other ideas?

jesser - have you made any progress?

Thanks!

No.

I have tried running without X and actually that seems to fix problem issue #2 -

I don’t see the inconsistency in the transfer rates as I do running X.

The speeds are still very slow, and I have attached my bug report file.
nvidia_bug_report.log.txt (94.4 KB)

NVIDIA, Any progress on this? What should I do to troubleshoot?

As I stated earlier, I don’t see anything here to suggest a CUDA bug. Have you run non-CUDA applications which do not exhibit this problem?

Hi netllama, jesser -

I still don’t have any resolution on my end, but I saw your recent posts and wanted to chime in. The trouble (at least for me) is that I have no way to test the PCIE bandwidth other than using the GPU’s bandwidth test. At this point, pending any other insight, I’m resigned to assuming that the problem is a hardware problem at the motherboard level. I’m a little surprised since it’s a Dell Precision 370 Workstation, and I would have thought that this would have been a fairly well documented problem if it were a design flaw, so perhaps it is a degraded failure mode or something. I have heard it stated that even though the motherboard may claim PCIE x16 (and indeed, the GPU believes it is getting x16), in fact it is less - perhaps only x4. We have tested on-chip performance of our card, and it belts out calculations at full-speed - so the problem is definitely in the PCIE transfer speed. A hardware motherboard problem would explain my slower transfer rates to and from the card. As I indicated earlier, I have no way to independently (without the card in) test my PCIE bus bandwidth to confirm this hypothesis. If anyone has any suggestions, please let me know.