SimpleStream sample unexpected results

My output for the SimpleStream sample in the CUDA SDK is:

[ simpleStreams ]

> Device name : GeForce 8800 GT

> CUDA Capable SM 1.1 hardware with 14 multi-processors

> scale_factor = 1.0000

> array_size   = 16777216

memcopy:		59.38

kernel:		 47.43

non-streamed:   63.72 (106.81 expected)

4 streams:	  55.06 (62.27 expected with compute capability 1.1 or later)

-------------------------------

Test PASSED

Press ENTER to exit...

Whats up with that? My non-streamed timing is much less then the sum of the memcopy and the kernel and almost half the expected value. Anyone has similar results?

Dietger

well, I can’t help you, but I have unexpected results too:

[ simpleStreams ]

> Device name : GeForce 8800 GTS 512

> CUDA Capable SM 1.1 hardware with 16 multi-processors

> scale_factor = 1.0000

> array_size   = 16777216

memcopy:	262.42

kernel:		39.19

non-streamed:	298.63 (301.61 expected)

4 streams:	265.29 (104.79 expected with compute capability 1.1 or later)

Anyone knows whats going on? I’m using ubuntu 9.10, kernel 2.6.31-19 and 195.30 driver.

No, I don’t know, although the strong theme in both your problems is G92 based GPUs. On my GTX275, it seems to work as expected (Ubuntu 9.04 x86_64, 2.6.28-17-generic with 190.53 drivers):

[ simpleStreams ]

> Device name : GeForce GTX 275

> CUDA Capable SM 1.3 hardware with 30 multi-processors

> scale_factor = 1.0000

> array_size   = 16777216

memcopy:	13.12

kernel:		16.62

non-streamed:	28.12 (29.74 expected)

4 streams:	19.04 (19.90 expected with compute capability 1.1 or later)

-------------------------------

Test PASSED

Press ENTER to exit...