the influence of cudaFree() on parallelism of cuda streams

the following pseudo code is used to process massive work using several streams(I omit some parameters in functions):

cudaStream_t stream[MAXSTREAM];
for(int i = 0; i < n; i ++)
{
    int *h_toDevice;
    cudaMallocHost((void **)&h_toDevive);
    memset(h_toDevive);

    int *d_toDevice;
    cudaMalloc(&d_toDevice);

    cudaMemcpyAsync(d_toDevice,h_toDevive,stream[i%MAXSTREAM]);

    kernelFunc<<<stream[i%MAXSTREAM]>>>();

    cudaFreeHost(h_toDevice);
    cudaFree(d_toDevice);
}

the above code does’t perform well as expected.But I find that if I delete the functions cudaFreeHost() and cudaFree(), the total time will decrease from 100 seconds to less than 10 second.(the MAXSTREAM is 10)
I doubt that the two functions have some influence on stream parallelism, so I set MAXSTREAM to 1, so there is no parallelism in the process. Then the time of both conditions become very close, they both use 100 seconds with or without the cudaFree and cudaFreeHost.
So do the two functions indeed have some influence on the parallelism of cuda stream? Could anyone help me? thx!

yes, they do. They are synchronizing. If you want to see the effect, run your code with the visual profiler.

It’s recommended that you avoid functions like cudaMalloc, cudaMallocHost, cudaFree, cudaFreeHost, in loops that are processing data in a time-critical way.

Thank you!