In my experience, sometimes this happens as a result of WDDM command batching.
One solution is to use a GPU in TCC mode.
Another possible solution in that case is to force the flush of the WDDM command queue at the necessary point in your program. The method to do this has varied over time, but in my case I believe sending a cuda event into the same stream as the kernel, after the kernel launch, and then strategically putting cudaEventQuery on that event, into your loop on the host that is updating the progress bar.
hi @Robert_Crovella, this issue came back to me as the workaround stopped working about a year later after our above exchange.
I just recompiled your sample code on stackoverflow on a windows machine, it hanged as well - I just want to check with you and see if you are still able to run your sample code. If it is supposed to work, I will look into other changes I made related to this progress bar.
As-is, the code really did not work under windows WDDM and I think that is discussed in the comments underneath the answer. According to my testing on RTX 2070 (WDDM), CUDA 10.1, windows 10, driver 432.00, if I add a cudaEventQuery(stop) where indicated in the now-updated answer, and run that code, it seems to run fine for me.
yes, your updated example compiles and runs as expected, in fact, my commit made back in 2017 also works - for some reason, in 2018, I was convinced that some new driver broke this commit, so I disabled that patch, now after reinstating that patch, the progress bar works again, I tested on two windows machines, one with the latest driver, no problem.
on Linux, without this patch, the progress bar works well - is this patch only needed for Windows WDDM driver? or I also need to enable this on Mac OS?