simpleStreams sample shows almost no speed up Is it expected ?

Romant · October 15, 2008, 7:57am

When running simpleStreams as a test the following results are produced:

running on: GeForce GTX 280
memcopy: 157.78
kernel: 17.08
non-streamed: 173.38 (174.86 expected)
4 streams: 159.47 (56.52 expected with compute capability 1.1 or later)

Test PASSED

GPU is with 1.3 compute capability, motherboard is old (asus on nForce 4 SLI), CPU is Athlon x2 4800+, GPU driver version is 178.08.

It seems like streams do not their job as expected ? What is the reason of such behaviour ?

cbuchner1 · October 15, 2008, 9:15am

If you run this on vista, note that streams are not supported on Vista.

Romant · October 15, 2008, 11:56am

No, it’s XP.

MisterAnderson42 · October 15, 2008, 12:26pm

There is something fishy about your memcpy numbers. They are much higher than what I get when I run simpleStreams:

$ ./simpleStreams running on: GeForce GTX 280

memcopy: Â  Â  Â  Â 17.79

kernel: Â  Â  Â  Â  17.21

non-streamed: Â  33.93 (35.00 expected)

4 streams: Â  Â  Â 18.71 (21.65 expected with compute capability 1.1 or later)

This was running in linux64, by the way.

Does bandwidthTest show any issues with host<->device bandwidth on your system?

Romant · October 15, 2008, 1:22pm

There is something fishy about your memcpy numbers. They are much higher than what I get when I run simpleStreams:
$ ./simpleStreams running on: GeForce GTX 280

memcopy: ï¿½ Â ï¿½ Â ï¿½ Â Â 17.79

kernel: ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â 17.21

non-streamed: ï¿½ Â 33.93 (35.00 expected)

4 streams: ï¿½ Â ï¿½ Â Â 18.71 (21.65 expected with compute capability 1.1 or later)
This was running in linux64, by the way.

Does bandwidthTest show any issues with host<->device bandwidth on your system?

[snapback]452276[/snapback]

This is what bandwidth test reports:

“”“”“”"

Running on…

  device 0:GeForce GTX 280

Quick Mode

Host to Device Bandwidth for Pageable memory.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 371.2

Quick Mode

Device to Host Bandwidth for Pageable memory.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 404.3

Quick Mode

Device to Device Bandwidth.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 114740.5

&&&& Test PASSED

Press ENTER to exit…

MisterAnderson42 · October 15, 2008, 2:31pm

Well, there is your problem. You’re only getting ~400 MiB/s to/from the device. You should be getting ~4 GiB/s. Unfortunately, these issues are very hard to debug:

Try turning off Aero and any processes that might be using the graphics card heavily
The standard NVIDIA answer is to make sure you are running the latest BIOS
Try running the latest driver. tmurray posted it to the windows XP forums a few days ago.

Romant · October 15, 2008, 2:37pm

It is XP, no Aero at all :-)

Driver is 178.08, the latest one.

The hardware may be the reason … it is relatively old ASUS motherboard on nForce 4 SLI chipset, as I have two GPUs inserted it shares 16 PCI-E lanes among them providing only 8 to each GPU.

May this slow things down so significantly ?

MisterAnderson42 · October 15, 2008, 3:50pm

Well, I am awake today, aren’t I.

Oh, I guess 178.08 is the one from the XP forum. Another example that I’m awake. I haven’t booted into XP for months anyways so I don’t keep up on these things.

It could. Still 8x should only slow things down by a factor of two, assuming the BIOS is smart enough to optimally handle the situation. Since it is an old board, it is PCIe v1, correct? What are the results from bandwidthTest --memory=pinned? That should get at least ~1.5 GiB/s even in your configuration.

Romant · October 15, 2008, 5:02pm

Yeah, about 1.5Gb … the bottleneck seems to be found.

tmurray · October 15, 2008, 5:24pm

I think that for some reason GT200 performs better with four streams than eight–give that a shot.

(also your bandwidth test numbers are super-lame :( )

paulius · October 15, 2008, 5:54pm

The estimate for the streamed performance assumes that the memcopy is shorter than the kernel - the estimated time is the kernel time plus memcopy time divided by the number of streams.

In your case, you do have a strange issue with the observed PCIe bandwidth, making the memcopy much longer. So, the correct estimate in your case would be to take the full memcopy time plus kernel time divided by the number of streams (it’s always the shorter operation that gets portions of it “hidden” by stream overlap).

With this math, streaming works in your case as well as theoretically possible.

Paulius

Reimar · October 16, 2008, 6:20am

I hope whatever you want to do with CUDA involves lots and lots of calculation without any data transfer GPU<->CPU.

I get ca. 1.2 GB/s “normal” and 3.1 GB/s pinned, and I had to move more of the algorithm to the GPU than I had wanted just to avoid copies (these parts are slower and more complex on the GPU but still faster than copying the data).

Topic		Replies	Views
Streaming issue? SimpleStream results not as expected. CUDA Programming and Performance	1	4483	October 9, 2009
SimpleStream sample unexpected results CUDA Programming and Performance	2	1486	February 10, 2010
Quadro SimpleStreams -- please help! actual time not even close to expected time CUDA Programming and Performance	0	5742	September 15, 2010
C2050 simplestreams performance. CUDA Programming and Performance	1	5592	July 30, 2010
non-streamed and 4 streamed much lower than expected CUDA Programming and Performance	0	853	June 18, 2009
SimpleStreams and Asyncmemory copies are slow CUDA Programming and Performance	4	2188	February 12, 2009
non-streamed and 4 streamed much lower than expected CUDA Programming and Performance	0	3099	June 18, 2009
Memory bandwidth CUDA Programming and Performance	31	38537	October 5, 2007
simpleStreams FAILED CUDA Programming and Performance	0	821	November 15, 2011
WinXP & 2GPUs & Slow Streams CUDA Programming and Performance	0	4251	August 5, 2009

simpleStreams sample shows almost no speed up Is it expected ?

running on: GeForce GTX 280 memcopy: 157.78 kernel: 17.08 non-streamed: 173.38 (174.86 expected) 4 streams: 159.47 (56.52 expected with compute capability 1.1 or later)

Related topics

running on: GeForce GTX 280
memcopy: 157.78
kernel: 17.08
non-streamed: 173.38 (174.86 expected)
4 streams: 159.47 (56.52 expected with compute capability 1.1 or later)