cudaMemcpyAsync slower than cudaMemcpy?

Hi,

I wrote a simple program using streams similar to the simpleStreams example, but with different input and output. When I use cudaMemcpyAsync it takes twice as much time to copy, if compared to cudaMemcpy (same copy operations).

Does anyone know why this is so? is it common?

Regards

up