Performance problem when loading multiple GPU system with independent simulations

No, that is not related.

It’s not related to synchronizations. It’s connected to contention for a shared internal host-based resource managed by the CUDA runtime, where the access control often involves acquisition of a lock. The contention for the lock (and indeed simultaneous access to the shared resource) is causing the increase in the time duration of cudaMalloc/cudaFree. None of this is documented (the above link indicates this aspect of CUDA runtime behavior is explicitly undocumented, and subject to change), but you can find posts on these forums where people have provided evidence that locks are being contended for, in at least some of these cases.