I am timing a kernel and I want to compare it to a CPU implementation by using wall-clock time. I use the “time” command on Linux and when I run my CPU implementation the time spent on the system is around 0.15 sec for a 30 sec real time (not a big overhead to worry about). However, when I run my CUDA code the system time is 1.5 sec for a real time of 7.5 sec. It makes a big difference because if the system time during the CUDA run was again 0.15 sec the speedup would have been 30/6.15~=5 instead of 30/7.5=4. I don’t run anything else on the machine while doing this timing, so there is no other specific factor that would cause this increased system time during the kernel execution.
Does anybody know the reason of this increased system overhead during the kernel execution? I suspect that it might be related to the kernel because of the GPU usage (drivers, etc) and I should factor it in for my timing, but I am not completely sure. If I can disregard this time I could use only the user time, but I don’t want to do that if that extra system time is really due to the GPU usage.
Is there a way to see specifically which system tasks take that 1.5 sec during the kernel execution?