Is it possible cudaMemcpy can consume more than 100 milliseconds for just a few bytes of data?

I found on Jetson AGX Xavier that cudaMemcpy() occupies 180ms when I call

GPU_CHECK (cudaMemcpy(host_ptr, device_ptr, sizeof(int), cudaMemcpyDeviceToHost));

Which cause may lead to this abnormal behavior?

I intentionally repeat this function 3 times, the first call needs about 180 ms while the subsequent ones only occupy a few micro seconds << 1 ms.

What is the reason for this phenomenon?

context->enqueue() is an asynchronized operation.
The subsequent cudaMemcpy has to wait for the inference to finish. Maybe this cause the first call of cudaMemcpy to be so slow. Am I right?


add a cudaDeviceSynchronize() before the call to cudaMemcpy. This will cause any asynchronous work to finish. Then start the timing and the cudaMemcpy call.

Your suggestion works. Thanks a lot.