Issue with running CPU and GPU code Asynchronously

Hello fellow developers. I have recently started to learn how to use CUDA for general programming and I came across something that puzzles me.

Suppose the following scenario (without loss of generality):

  • initially I launch a kernel that runs for 1000 ms.
  • after the call to the kernel, my next immediate sequence of host instructions run for 1000 ms. A big loop let’s say. The point of the loop is to simply keep the CPU busy for 1000 ms.
  • after the loop I copy the memory from the device to host RAM.

Given the above scenario am I right to assume that the total CPU time should be 1000 ms (+/- epsilon).

As I thought, given that the kernel launch is asynchronous, the total CPU time should be 1000 ms, but all the tests that I performed show the opposite. To me it seems counter-intuitive, can someone please calrify this for me?

P.S. I don’t do any calls to - cudaThreadSynchronize();