simpleStreams sample shows almost no speed up Is it expected ?

When running simpleStreams as a test the following results are produced:

running on: GeForce GTX 280
memcopy: 157.78
kernel: 17.08
non-streamed: 173.38 (174.86 expected)
4 streams: 159.47 (56.52 expected with compute capability 1.1 or later)


GPU is with 1.3 compute capability, motherboard is old (asus on nForce 4 SLI), CPU is Athlon x2 4800+, GPU driver version is 178.08.

It seems like streams do not their job as expected ? What is the reason of such behaviour ?

If you run this on vista, note that streams are not supported on Vista.

No, it’s XP.

There is something fishy about your memcpy numbers. They are much higher than what I get when I run simpleStreams:

$ ./simpleStreams running on: GeForce GTX 280

memcopy: Â  Â  Â  Â 17.79

kernel: Â  Â  Â  Â  17.21

non-streamed: Â  33.93 (35.00 expected)

4 streams: Â  Â  Â 18.71 (21.65 expected with compute capability 1.1 or later)

This was running in linux64, by the way.

Does bandwidthTest show any issues with host<->device bandwidth on your system?

This is what bandwidth test reports:


Running on…

  device 0:GeForce GTX 280

Quick Mode

Host to Device Bandwidth for Pageable memory.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 371.2

Quick Mode

Device to Host Bandwidth for Pageable memory.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 404.3

Quick Mode

Device to Device Bandwidth.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 114740.5

&&&& Test PASSED

Press ENTER to exit…

Well, there is your problem. You’re only getting ~400 MiB/s to/from the device. You should be getting ~4 GiB/s. Unfortunately, these issues are very hard to debug:

  1. Try turning off Aero and any processes that might be using the graphics card heavily
  2. The standard NVIDIA answer is to make sure you are running the latest BIOS
  3. Try running the latest driver. tmurray posted it to the windows XP forums a few days ago.

It is XP, no Aero at all :-)

Driver is 178.08, the latest one.

The hardware may be the reason … it is relatively old ASUS motherboard on nForce 4 SLI chipset, as I have two GPUs inserted it shares 16 PCI-E lanes among them providing only 8 to each GPU.

May this slow things down so significantly ?

Well, I am awake today, aren’t I.

Oh, I guess 178.08 is the one from the XP forum. Another example that I’m awake. I haven’t booted into XP for months anyways so I don’t keep up on these things.

It could. Still 8x should only slow things down by a factor of two, assuming the BIOS is smart enough to optimally handle the situation. Since it is an old board, it is PCIe v1, correct? What are the results from bandwidthTest --memory=pinned? That should get at least ~1.5 GiB/s even in your configuration.

Yeah, about 1.5Gb … the bottleneck seems to be found.

I think that for some reason GT200 performs better with four streams than eight–give that a shot.

(also your bandwidth test numbers are super-lame :( )

The estimate for the streamed performance assumes that the memcopy is shorter than the kernel - the estimated time is the kernel time plus memcopy time divided by the number of streams.

In your case, you do have a strange issue with the observed PCIe bandwidth, making the memcopy much longer. So, the correct estimate in your case would be to take the full memcopy time plus kernel time divided by the number of streams (it’s always the shorter operation that gets portions of it “hidden” by stream overlap).

With this math, streaming works in your case as well as theoretically possible.


I hope whatever you want to do with CUDA involves lots and lots of calculation without any data transfer GPU<->CPU.

I get ca. 1.2 GB/s “normal” and 3.1 GB/s pinned, and I had to move more of the algorithm to the GPU than I had wanted just to avoid copies (these parts are slower and more complex on the GPU but still faster than copying the data).