Device memory allocation implicit synchronization

I am having difficulty understanding implicit synchronization due to device memory allocation.

This blog post:

demonstrates that this code:

const int N = 1 << 20;

__global__ void kernel(float *x, int n)
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
        x[i] = sqrt(pow(3.14159,i));

int main()
    const int num_streams = 8;

    cudaStream_t streams[num_streams];
    float *data[num_streams];

    for (int i = 0; i < num_streams; i++) {
        cudaMalloc(&data[i], N * sizeof(float));
        // launch one worker kernel per stream
        kernel<<<1, 64, 0, streams[i]>>>(data[i], N);

        // launch a dummy kernel on the default stream
        kernel<<<1, 1>>>(0, 0);


    return 0;

will execute the kernel in parallel on each of the 8 streams even though there is cudaMalloc on line 21. I have been able to reproduce the profile from the blog post so that the kernels execute in parallel. However, the cuda programming guide in section on implicit synchronization states that:

“Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:

a device memory allocation,

This statement from the programming guide seems to contradict the observed behavior. However, if the memory leak in this example is fixed by adding a cudaFree as the last command inside the for loop (line 28), the kernels are executed serially as the programming guide would suggest. How are the kernels able to execute in parallel in the original example even though a synchronous cudaMalloc instruction is issued between the kernel launches? Why does freeing the memory after the kernel launch serialize the kernels?

I have the same question.

Here, I have the same question as well