cudaHostAlloc memory initial time

Hello everyone! I know stream and asynchronous commands can achieve better performance. I run my code on TX2,and the code sample is like below.The problem is the most time is wasted on step3. when i use cudaMemcpy directly the data copy time is 20~40ms, but now use cudaMemcpyAsync, step3 is almost 40~60ms.what can i do to solve this problem?

cudaHostAlloc((void **)&h_a, sizeof(float)*N, cudaHostAllocDefault);
cudaMalloc((void **)&d_a, sizeof(float)*N);
for(int i=0; i<N; i++)
        h_a[i] = src_a[i];

cudaMemcpyAsync(d_a, h_a, sizeof(float)*N/nstreams, cudaMemcpyHostToDevice, stream[i]);