cudaMemcpyAsync decrease the data transfer performance?

Hi, all

In my project, i want to use overlap data transfer and kernel launch to boost the App performance.

But, when using cudaMemcpyAsync whit streamId not equal to 0, the data transfer between host and device decreased.

here is my source code

[codebox] for(int offset = 0; offset < iqSize; offset += fftSize*nStream)


			for(int j = 0; j < nStream; j++)

				CUDA_SAFE_CALL(cudaMemcpyAsync(d_iq[j], iq, sizeof(Complex)*fftSize, cudaMemcpyHostToDevice, stream[j]));

			for (int j = 0; j < nStream; j++)

				CUDA_SAFE_CALL(cudaMemcpyAsync(spectrum, d_spectrum[j], sizeof(Complex)*fftSize, cudaMemcpyDeviceToHost, stream[j]));


it takes almost 10ms, but when i replace stream[j] with 0, it only takes 8ms.