In the width plot of my app, there are significant portion of blanks, marked as “idle time”, between kernels represented by colored rectangles. The problem is that, according to the summary table of the same run, both the total CPU time and GPU time are pretty close to entire execution time. This makes me wonder what the blanks might be:
- kernel overhead, set up context, transfer config & parameters etc. But according to summary table, overhead can only contribute to a small part of the blanks, but some of the blanks are several times wider than most of my kernels.
- implicit memcpy, like DeviceToDevice cudamemcpy, that are not imcluded in the kernel list. After I replace them with a simple DeviceToDevice memcpy kernel, these kernels appear at some of the previously blank places. But still, they only account for a small portion of all the blanks.
Is there any other explanations for those blanks?
update: replaced cudaMemset with my own memset kernel, and most blanks are occupied now! App run a little bit faster as well. Funny that a 3 line memset kernel can outperform official function. Is it because of some additional overlapping? But I remember that all devicetodevice memory operations are asynchronous…