cudaDeviceSynchronize() waits for the GPU to finish before allowing the CPU thread continue. This may involve a polling busy loop (100% utilization of one CPU core) to achieve the lowest possible latency.
cudaFree() probably implicitly synchronizes the CUDA context as well because altering the memory heap on the device while it’s still computing would be unacceptably risky.
It’s likely the time spent in these API calls is just waiting for the GPU to finish. When you say that there is no problem in low iteration counts, it may also be that you are hitting some kind of limit on the kernel launch queue that causes blockage in larger iteration counts. Impossible to tell without knowing details of your kernel (i.e. how long does it compute for one iteration)
In general it is good advice to keep heap allocations out of tight compute loops. It’s better to allocate enough memory for your use case once and reuse that in the inner loops over and over. In performance critical cases you might have to allocate several buffers in page locked host memory, making use of CUDA streams to overlap memory transfers and compute.