There is something fishy about your memcpy numbers. They are much higher than what I get when I run simpleStreams:
$ ./simpleStreams running on: GeForce GTX 280
memcopy: Â Â Â Â 17.79
kernel: Â Â Â Â 17.21
non-streamed: Â 33.93 (35.00 expected)
4 streams: Â Â Â 18.71 (21.65 expected with compute capability 1.1 or later)
This was running in linux64, by the way.
Does bandwidthTest show any issues with host<->device bandwidth on your system?
Well, there is your problem. You’re only getting ~400 MiB/s to/from the device. You should be getting ~4 GiB/s. Unfortunately, these issues are very hard to debug:
Try turning off Aero and any processes that might be using the graphics card heavily
The standard NVIDIA answer is to make sure you are running the latest BIOS
Try running the latest driver. tmurray posted it to the windows XP forums a few days ago.
The hardware may be the reason … it is relatively old ASUS motherboard on nForce 4 SLI chipset, as I have two GPUs inserted it shares 16 PCI-E lanes among them providing only 8 to each GPU.
Oh, I guess 178.08 is the one from the XP forum. Another example that I’m awake. I haven’t booted into XP for months anyways so I don’t keep up on these things.
It could. Still 8x should only slow things down by a factor of two, assuming the BIOS is smart enough to optimally handle the situation. Since it is an old board, it is PCIe v1, correct? What are the results from bandwidthTest --memory=pinned? That should get at least ~1.5 GiB/s even in your configuration.
The estimate for the streamed performance assumes that the memcopy is shorter than the kernel - the estimated time is the kernel time plus memcopy time divided by the number of streams.
In your case, you do have a strange issue with the observed PCIe bandwidth, making the memcopy much longer. So, the correct estimate in your case would be to take the full memcopy time plus kernel time divided by the number of streams (it’s always the shorter operation that gets portions of it “hidden” by stream overlap).
With this math, streaming works in your case as well as theoretically possible.
I hope whatever you want to do with CUDA involves lots and lots of calculation without any data transfer GPU<->CPU.
I get ca. 1.2 GB/s “normal” and 3.1 GB/s pinned, and I had to move more of the algorithm to the GPU than I had wanted just to avoid copies (these parts are slower and more complex on the GPU but still faster than copying the data).