cudaMallocPitch is slow on A100

Code is migrated from 2080TI to A100. And it has been noticed that the cudamallocpitch and cudaFree would have runtime spikes, both would spike to approx. 300ms. This kind of behavior has not been observed on 2080TI.

Driver version: 450.172.01
Cuda version: 11.0