Using pinned memory to pass simple data fails on Windows

I’d like to pass a kernel progress bar (as an integer) to the host so that the host can print a progress bar while the kernel is still running.

I used the pinned memory, as suggested by others. It works really well on Linux, but fails every time on Windows.

Here is my source code:the host memory variable is *progress; it is mapped to the GPU memory gprogress, both declared as “volatile int”.

then, the kernel updates the gprogress pointer,

I could then use the “*progress” value on the host-side to create a progress bar

on a Windows 10 machine (cuda 8, 1050Ti+730GT), the “*progress” stays at 0, making my above loop never-ending. It was updated fine on Ubuntu Linux.

I even added “__threadfence system()” inside the kernel, but it did not help.

can some one take a quick look and let me know if there is anything wrong? thanks

In my experience, sometimes this happens as a result of WDDM command batching.

One solution is to use a GPU in TCC mode.

Another possible solution in that case is to force the flush of the WDDM command queue at the necessary point in your program. The method to do this has varied over time, but in my case I believe sending a cuda event into the same stream as the kernel, after the kernel launch, and then strategically putting cudaEventQuery on that event, into your loop on the host that is updating the progress bar.

A similar report is here:

@txbob, the link was spot on. I managed to fix the issue by adding a cudaEventQuery() call before querying the variable value.

It appears that this was also worked for other people who had the same issue on windows.

again, thanks a lot for the pointers!

hi @Robert_Crovella, this issue came back to me as the workaround stopped working about a year later after our above exchange.

I just recompiled your sample code on stackoverflow on a windows machine, it hanged as well - I just want to check with you and see if you are still able to run your sample code. If it is supposed to work, I will look into other changes I made related to this progress bar.


As-is, the code really did not work under windows WDDM and I think that is discussed in the comments underneath the answer. According to my testing on RTX 2070 (WDDM), CUDA 10.1, windows 10, driver 432.00, if I add a cudaEventQuery(stop) where indicated in the now-updated answer, and run that code, it seems to run fine for me.

C:\Users\Robert Crovella\source\repos\progress_bar\x64\Release>progress_bar
kernel starting
h_data = 1
h_data = 2
h_data = 3
h_data = 4
h_data = 5
h_data = 6
h_data = 7
h_data = 8
h_data = 9
progress check finished
matrix multiply kernel starting
percent complete = 10.0
percent complete = 20.0
percent complete = 30.0
percent complete = 40.0
percent complete = 50.0
percent complete = 60.0
percent complete = 70.0
percent complete = 80.0
percent complete = 90.0

matrix multiply finished.  elapsed time = 985.443054 milliseconds

hi @Robert_Crovella, my bad.

yes, your updated example compiles and runs as expected, in fact, my commit made back in 2017 also works - for some reason, in 2018, I was convinced that some new driver broke this commit, so I disabled that patch, now after reinstating that patch, the progress bar works again, I tested on two windows machines, one with the latest driver, no problem.

on Linux, without this patch, the progress bar works well - is this patch only needed for Windows WDDM driver? or I also need to enable this on Mac OS?