cudaStreamSync and WDDM relation

Hi everyone,

I’m learning about WDDM effect on kernel launches. However, only read its explanation here and there without actual proofs, like screenshots from Nsight Performance Analysis, is hard to make it get through my thick head. Therefore, I would like to ask for your favor to point it out directly in my attached screenshots below. Here are some pages which I found to be quite informative about WDDM affection:
https://stackoverflow.com/questions/12196044/time-between-kernel-launch-and-kernel-execution
https://devtalk.nvidia.com/default/topic/548639/is-wddm-causing-this-
https://devtalk.nvidia.com/default/topic/525137/?comment=3718330

My current problem with it is that I’m having some idle periods between kernel launches in my program (Nonsync1 and Nonsync2 images). The nonsync version program cycles in a loop of various kernel launches without any cudaStreamSync call. However, if I switch to a sync version, call streamSync once within each loop, those gaps disappear! (Sync1 and Sync2 screenshots) Here are my questions regarding this topic:

  1. In a program which contains only kernel/memcpy/memset calls, is it always WDDM false if there occur repeated gaps in the execution timeline?
  2. In my case, is this behavior also because of WDDM? If yes, can you point out which one in my screenshot? If it doesn't appear in my screenshots, could you please instruct me where to capture it footprint?
  3. What is your guess/hypothesis about why adding the cudaStreamSync would eliminate those gaps as in my 2nd version program?

Thanks so much for your help!