I’m working on 8800GTX with CUDA 2.0. I profiled my implementation and found some weird things (to me).
I called three kernels in series. (Let’s call them A,B and C)
The CUDA profiler gave a result like this.
A: GPU time 1194.59 CPU time 1212.81
B: 354.432 70474.5 (!!!)
C: 138.464 154.985
I wonder why B’s CPU time is so big, so I ran them without B, (i.e. A and C) and the CUDA profiler gave
A: GPU time 354.784 CPU time 1221.18
C: 131.584 147.936
I do not think this is not accurate, because the kernel A does not change at all, but its GPU time changes a lot. I tried to use the previous version’s profiler, but it gives the same result.
Do you have any idea about it?
Also, I wonder what makes the CPU overhead. Does it depend on the number of threads or the size of grids? Do the cuda Functions like cudaMemalloc or cudaMemset affect the CPU overhead?
Any ideas or information will be appreciated.