stream execution and smem usage

I am using CUDA2.0 with the newest driver however there is no speed up running SimpleStreams with either 8800GT or 9800GX2. I am using Linux and I did compile with “-arch sm_11 -code sm_11” or “-arch compute_11 and -code compute_11”, however non of them worked.

./simpleStreams
memcopy: 40.84
kernel: 43.22
non-streamed: 83.99 (84.06 expected)
8 streams: 84.71 (48.32 expected with compute capability 1.1 or later)

Test PASSED

Also could someone please clarify the verbose ptxas output? What does the “Used n+m bytes smem” mean, especially the “m” part? Thanks.