I am seeing a major performance hit when doing asynchronous memcpy, this is easily seen running the simpleStreams program, it is like they are running serially and not in parallel.
OS: Fedora Core 8
MB:asus P5N-T 780i Motherboard.
GPUs:9800 GT, Tesla 1060.
CUDA: 2.0
Driver: 180.22
Output from simpleStreams:
running on: Tesla C1060
memcopy: 48.44
kernel: 53.73
non-streamed: 45.35 (102.16 expected)
4 streams: 266.74 (65.84 expected with compute capability 1.1 or later)
If I was that close to the expected value, I wouldn’t complain so much, but a comparison on my laptop which has a Geforce 8700m GT,
[b]memcopy: 42.67
kernel: 54.50
non-streamed: 123.37 (97.17 expected)
4 streams: 62.76 (65.17 expected with compute capability 1.1 or later)
[/b]
I would expect something similar to the expected results, and I certainly wouldn’t expect the non-streamed version to out perform the multi-stream version.
We have determined the cause of this problem, we have a custom PCIe card that is some how causing problems in the PCIe switch which is causing our problem. After removing the card, we get transfer rates that we expect:
[b]running on: Tesla C1060
memcopy: 20.31
kernel: 25.14
non-streamed: 45.33 (45.45 expected)
4 streams: 28.14 (30.22 expected with compute capability 1.1 or later)