Cudamalloc affects the delay of cudalaunchkernel CPU launching latency

Noticed that cudamalloc will affect the latency of the API call of the kernelLaunch that follows.

scene 1:separate cudamalloc before each calculation
In second loop , the first cudaLaunchKernel API CPU launching took about 14us ,and the rest times about 5us
code such as:

//malloc matrix first time , n means matrix size ,n=1024
cudaMalloc(&d_x1, sizeof(float)* n * n);
cudaMalloc(&d_y1, sizeof(float)* n * n);
cudaMemcpy(d_x1, x, sizeof(float)* n * n, cudaMemcpyHostToDevice);
cudaMemcpy(d_y1, y, sizeof(float)* n * n, cudaMemcpyHostToDevice);
//the first loop
     for (int j=0;j<KERNEL_NUMS;j++){
         //in this loop,first time kernel launch 30us, the rest  5us
         matMultCUDA <<< blocks_num, THREAD_NUM, 0, stream1 >>>(d_x1 , d_y1 , d_z , n );
     }
cudaStreamSynchronize(stream1);

//malloc matrix again 
cudaMalloc(&d_x2, sizeof(float)* n * n);
cudaMalloc(&d_y2, sizeof(float)* n * n);
cudaMemcpy(d_x2, x, sizeof(float)* n * n, cudaMemcpyHostToDevice);
cudaMemcpy(d_y2, y, sizeof(float)* n * n, cudaMemcpyHostToDevice);
//the second loop
     for (int j=0;j<KERNEL_NUMS;j++){
         ***//The first kernel launch took about 14us in this loop, rest is about 5us***
         matMultCUDA <<< blocks_num, THREAD_NUM, 0, stream1 >>>(d_x2 , d_y2 , d_z , n );
     }
cudaStreamSynchronize(stream1);

scene 2: Malloc all addresses before execution
The first cudaLaunchKernel API took about 5us in second loop

//malloc all address , n means matrix size ,n=1024
cudaMalloc(&d_x1, sizeof(float)* n * n);
cudaMalloc(&d_y1, sizeof(float)* n * n);
cudaMemcpy(d_x1, x, sizeof(float)* n * n, cudaMemcpyHostToDevice);
cudaMemcpy(d_y1, y, sizeof(float)* n * n, cudaMemcpyHostToDevice);

cudaMalloc(&d_x2, sizeof(float)* n * n);
cudaMalloc(&d_y2, sizeof(float)* n * n);
cudaMemcpy(d_x2, x, sizeof(float)* n * n, cudaMemcpyHostToDevice);
cudaMemcpy(d_y2, y, sizeof(float)* n * n, cudaMemcpyHostToDevice);

//the first loop
   for (int j=0;j<KERNEL_NUMS;j++){
        //in this loop,first time kernel launch 30us, the rest  5us. same as scene 1
         matMultCUDA <<< blocks_num, THREAD_NUM, 0, stream1 >>>(d_x1 , d_y1 , d_z , n );
     }
cudaStreamSynchronize(stream1);

//the second loop
   for (int j=0;j<KERNEL_NUMS;j++){
         ***//each launch in the loop is 5us. perf is different from scene 1***
         matMultCUDA <<< blocks_num, THREAD_NUM, 0, stream1 >>>(d_x2 , d_y2 , d_z , n );
     }
cudaStreamSynchronize(stream1);

Don’t understand why the location of malloc will affect the cpu time-consuming of kernelLaunch API,Is there any implicit synchronization here?
At the same time, it is observed that the time consumption of launch after malloc will increase in proportion to the size of malloc

ENV: PCIE A100, cuda 11.4
time-consuming tool : nsight system

The timing methodology used here is not clear. What happens if you insert a call to cudaDeviceSynchronize() just prior to the first loop? This should guarantee that all prior work on the GPU has completed before the loop starts launching kernels.

If a cudaDeviceSynchronize()inserted just before the first loop does not make a difference, your observation may be due to cold start overhead.

The first time a particular code path is exercised, HW structures like caches and TLBs are not primed, and the same applies to initialization of some portion of software state. It is therefore a best practice to never time any code on the first pass through. Since some mechanisms take even longer to warm up and reach steady-state performance levels, conservative engineers might even ignore the first three passes or so.

Thanks for your advice ~
I have inserted cudaDeviceSynchronize() before first loop,and result has not changed. I agree with your that the first kernel launch in the first loop takes a long time due to a cold start.
But what puzzle me is why launch the same kernel in the second loop will be affected by the malloc location,just as the first launch in the second loop is different in time in two cases. I understand that it’s not be a cold start at this time.