Overlapping GPU and CPU computation?

I also tried executing the simple streams example provided in the SDK. As you can see in the image attached something is terribly wrong as the time taken when using streams is practically identical to the time taken without using streams :s Can anyone please shed some light on what is wrong?

Thanks in advance,

Steven
simplestream.png