Host optimization option with kernel invocation latency

Hi nvidia

I found weird things that if i set Host optimization level as 3 then Kernel invocation latency took so much.

in example before host optimization H to D copy takes 25us but after Host optimization it goes up 70us.

is this normal ? or am i missing something ?


It is not clear what and how you are measuring. An HtoD copy is not a kernel invocation. The basic work of kicking off work on the GPU is done inside the CUDA driver and runtime, binary executable components which are independent of optimizations applied to the host code in you application.

Without a reproducer code, I think you might be looking at some artifacts of CPU/GPU interaction, which is more likely if you are on a Windows platform using a WDDM driver. In the case of copies there might also be cold-start effects. This is all speculation as there is no information on the platform or the code running on it.

Thank you for fast reply njuffa !!!

I think I explained wrong …

As you know if H to D memcpy call then Runtime API occurred, i mean this latency.

I tried to upload picture of Visual Profiler Result but … I don’t know how … ;; (-_-)a …
And if I using Host optimize option then not only HtoD copy also all of kernel function invocation runtime API latency are increase.

And I’m not using Window platform I’m just using Tx2 with remote visual profiling via Host Linux PC.
Is this normal ? or … am i doing wrong …


Have you checked whether there is a clear correlation by repeating the measurements? Compile at -O2, repeat profiling three times. Compile at -O3, repeat profiling three times. Now go back to compile at -O2, repeat profiling three times.

Other than changing the optimization level of the build, have any other changes been made to the system (hardware or software)? Is it possible that there are other applications running on the system while you profile, either run by you, or by another user?

My working hypothesis here is that what you observe are either artifacts created by sub-optimal measuring methodology, or distortions of results due to outside factors. As I stated earlier, there is no causal relationship I can envision between the optimization level of the host portion of your code and the speed at which the CUDA software components do their work. You are measuring times in the microsecond range, so some fluctuation (“noise”) and random artifacts are likely, this is why I suggest the repetition of measurements.

I have never used a TX2. There may be specific caveats with respect to profiling on such a system I am not aware of, so you may consider also asking in the dedicated TX2 forum: