confusion about Acc. Kernel Timing data ?

JMa1 · January 15, 2013, 5:13am

Hi Mat & All,
How should I interpret the Acc. knernal timing data? I thougt:
total = init + kernels + data

But the following output from my run is very confusing, this total is much much more than the sum of those three. Do you know what’s the “other” time beyond those three items? Where does that part of time go?

Thanks,
Jingsen

Accelerator Kernel Timing data
…
89: region entered 13100 times
time(us): total=8,203,329 init=2,125 region=8,201,204
kernels=430,774 data=769,253
w/o init: total=8,201,204 max=10,256 min=244 avg=626
90: kernel launched 13100 times
grid: [8] block: [128]
time(us): total=324,067 max=68 min=23 avg=24
91: kernel launched 13100 times
grid: [1] block: [256]
time(us): total=106,707 max=68 min=8 avg=8

MatColgrove · January 15, 2013, 5:04pm

Hi Jingsen,

The “total” time for an accelerator region is measured from the host while the other timers (init, kernels, data) are taken from the device driver. Hence, the delta between the two is the time spent on the host.

Now exactly where the host time is being spent is something I’m currently investigating. This large of the delta only occurs from some regions but not others. From what I can tell, the host code seems to get blocked somewhere either in our runtime libraries or in CUDA device driver. The actual time spent blocked in each iteration isn’t that large, but when there is a large number of iterations, this time gets magnified. I am looking into it and hopefully can identify the problem. After that, we can determine if it’s a performance bug or at least explain the behaviour so it can be avoided. I’ll post more once I know more, but so far I’ve been unable to determine the exact cause.

Mat