My output for the SimpleStream sample in the CUDA SDK is:
[ simpleStreams ]
> Device name : GeForce 8800 GT
> CUDA Capable SM 1.1 hardware with 14 multi-processors
> scale_factor = 1.0000
> array_size = 16777216
memcopy: 59.38
kernel: 47.43
non-streamed: 63.72 (106.81 expected)
4 streams: 55.06 (62.27 expected with compute capability 1.1 or later)
-------------------------------
Test PASSED
Press ENTER to exit...
Whats up with that? My non-streamed timing is much less then the sum of the memcopy and the kernel and almost half the expected value. Anyone has similar results?
No, I don’t know, although the strong theme in both your problems is G92 based GPUs. On my GTX275, it seems to work as expected (Ubuntu 9.04 x86_64, 2.6.28-17-generic with 190.53 drivers):
[ simpleStreams ]
> Device name : GeForce GTX 275
> CUDA Capable SM 1.3 hardware with 30 multi-processors
> scale_factor = 1.0000
> array_size = 16777216
memcopy: 13.12
kernel: 16.62
non-streamed: 28.12 (29.74 expected)
4 streams: 19.04 (19.90 expected with compute capability 1.1 or later)
-------------------------------
Test PASSED
Press ENTER to exit...