Hello fellow developers. I have recently started to learn how to use CUDA for general programming and I came across something that puzzles me.
Suppose the following scenario (without loss of generality):
- initially I launch a kernel that runs for 1000 ms.
- after the call to the kernel, my next immediate sequence of host instructions run for 1000 ms. A big loop let’s say. The point of the loop is to simply keep the CPU busy for 1000 ms.
- after the loop I copy the memory from the device to host RAM.
Given the above scenario am I right to assume that the total CPU time should be 1000 ms (+/- epsilon).
As I thought, given that the kernel launch is asynchronous, the total CPU time should be 1000 ms, but all the tests that I performed show the opposite. To me it seems counter-intuitive, can someone please calrify this for me?
P.S. I don’t do any calls to - cudaThreadSynchronize();