context->enqueue() is an asynchronized operation.
The subsequent cudaMemcpy has to wait for the inference to finish. Maybe this cause the first call of cudaMemcpy to be so slow. Am I right?
add a cudaDeviceSynchronize() before the call to cudaMemcpy. This will cause any asynchronous work to finish. Then start the timing and the cudaMemcpy call.