Hi,
I wrote a simple program using streams similar to the simpleStreams example, but with different input and output. When I use cudaMemcpyAsync it takes twice as much time to copy, if compared to cudaMemcpy (same copy operations).
Does anyone know why this is so? is it common?
Regards