I found on Jetson AGX Xavier that cudaMemcpy() occupies 180ms when I call
GPU_CHECK (cudaMemcpy(host_ptr, device_ptr, sizeof(int), cudaMemcpyDeviceToHost));
Which cause may lead to this abnormal behavior?
I intentionally repeat this function 3 times, the first call needs about 180 ms while the subsequent ones only occupy a few micro seconds << 1 ms.
What is the reason for this phenomenon?