SimpleStreams and Asyncmemory copies are slow

I am seeing a major performance hit when doing asynchronous memcpy, this is easily seen running the simpleStreams program, it is like they are running serially and not in parallel.

OS: Fedora Core 8

MB:asus P5N-T 780i Motherboard.

GPUs:9800 GT, Tesla 1060.

CUDA: 2.0

Driver: 180.22

Output from simpleStreams:

running on: Tesla C1060

memcopy: 48.44

kernel: 53.73

non-streamed: 45.35 (102.16 expected)

4 streams: 266.74 (65.84 expected with compute capability 1.1 or later)


Test PASSED

Any ideas as to what I can try to resolve this?

For what its worth, this is a similar thread:

http://forums.nvidia.com/index.php?showtop…rt=#entry492314

Thanks for the link, not sure it helps though =(.

If I was that close to the expected value, I wouldn’t complain so much, but a comparison on my laptop which has a Geforce 8700m GT,

[b]memcopy: 42.67

kernel: 54.50

non-streamed: 123.37 (97.17 expected)

4 streams: 62.76 (65.17 expected with compute capability 1.1 or later)

[/b]

I would expect something similar to the expected results, and I certainly wouldn’t expect the non-streamed version to out perform the multi-stream version.

We have determined the cause of this problem, we have a custom PCIe card that is some how causing problems in the PCIe switch which is causing our problem. After removing the card, we get transfer rates that we expect:

[b]running on: Tesla C1060

memcopy: 20.31

kernel: 25.14

non-streamed: 45.33 (45.45 expected)

4 streams: 28.14 (30.22 expected with compute capability 1.1 or later)


Test PASSED[/b]

I’ve also encountered a similar problem.

I’ve been running code on a GTX280 and using a Quadro FX570 to run the computer display. In this configuration the streamed code was vastly slower.

However once I switched my desktop back to the GTX280, the streamed version ran as expected.

Anyone like to offer a more detailed explanation? Surely this needs a fix.