I am building a CUDA accelerated program to calculate very large digits (1E9) digits of pi using the BBP formula. When I run the program using the display GPU for values requiring more than 2 seconds of computation time, my system freezes and the program never completes.
Comparatively if I run the program using my second GPU the program is able to run for very long times (6 hours+) with the system being unaffected.
This behavior seems reasonable, but I have been building the application with the intention of utilizing both GPUs. It still functions correctly if the total runtime is under 2 seconds, however the console freezes (program displays progress ticker in console) for the duration of that run.
I’d like to reiterate that this issue occurs running with only the display GPU and both GPUs, but not when running solely on the secondary GPU.
My system does NOT have an integrated GPU in the CPU, so using that as the display driver is not possible. The GPUs are both Titan X (Pascal) (not the Titan Xp) GPUs, and the CPU is an i9-7900x. I am using Windows 10, and I have already set TdrLevel to 0 in the registry in order to enable the very long runtimes necessary to calculate large digits of pi.
You can look at the code here: GitHub - euphoricpoptarts/bbpCudaImplementation
I haven’t pushed the multi-GPU code as it can’t be said to work since I can’t test it on values requiring runtimes more than 2 seconds.
I have considered splitting the kernel into segments that complete in less than 2 seconds, but I would prefer a different solution if one exists.
Are the GPUs’ async schedulers unable to give the display any access to the GPU? I would be fine with whatever slight reduction in performance that would cause. In addition, is it possible to disable the display for the duration of the execution? I don’t really need to use the PC while this program is running.
What you observe are the consequences of the operating system’s GUI watchdog timer kicking in. This effect is documented, and I think this forum may even have a pinned thread about this very topic.
At any given time, a GPU can either serve the GUI for display purposes or work on a compute kernel. Running lengthy CUDA kernels therefore blocks GUI updates. A blocked GUI makes the system unresponsive to user input/output (resulting in the “freezing” effect), but it is still running. All operating systems supported by CUDA have a time limit that they allow the GUI to “hang”, typically it is 2 seconds. If that limit is exceeded, the watchdog kicks in and resets the entire display driver stack, which also destroys the CUDA context.
A best practice is to split GPU work into compute kernels that have run times well under the two second mark. A practical solution is also to never use a display GPU for computation. Many people do this by using a high-end GPU for computation while using a second GPU for display purposes. If they don’t have complex visualization requirements, a low-end GPU (>= $85) will work just fine for driving the display.
You may be able to manipulate the watchdog timer by either disabling it or dialing in a much longer timeout. The manner of doing so is operating system dependent, Google is your friend. You mentioned manipulating one specific registry entry on Windows. This many not be sufficient: I vaguely remember that on Windows one needs to change at least two registry entries to effectively disable the watchdog timer. You may also have created the relevant registry entries in the wrong place, as I think the TDR registry entries don’t exist by default? Microsoft gives an overview here:
If you’re still observing a 2s limit, my guess is that your attempts to modify the registry were not correct. As njuffa points out, its OS-specific and a bit arcane in my opinion, as many registry editing adventures can be.
Even with TDR adjustment, my personal experience is that windows can get somewhat unstable if you run a GPU kernel that takes longer than 10-30 seconds. Therefore, you may wish to consider some of the other methods suggested by njuffa.
Thanks njuffa and txbob. I’m gonna look into the Nsight VSE that txbob mentioned, as I don’t believe I’m currently using it. I think what I ultimately will have to do is what njuffa suggested and split the kernels in my program into smaller chunks with runtimes under half a second or less. My goal for this program is to make it runnable on any Windows 10 system with an arbitrary number of GPUs, and I don’t want to have to tell people who want to use it that they need to edit their registries. I wanted to avoid that as it would add complexity to the program that I consider unnecessary, but it probably won’t be that hard. I’ll make an update on how things turn out.
Indeed then the suggestion given to you by njuffa is definitely the best and really the only way to go. For general deployments on windows WDDM GPUs, it’s essential to make sure any CUDA kernel you launch is of short duration. As a rule of thumb, I would aim for no longer than 0.1 second per kernel launch, so as to minimize any disruption to GUI usability. A 0.5 second interruption to GUI display responsiveness will be quite noticeable.
I updated my kernel to split it into chunks that can run in under 0.1 seconds, and I now have a loop that launches each kernel sequentially after the previous one finishes. I can now run the program for arbitrarily long without triggering the TDR, however, the desktop is pretty unresponsive intermittently while the program is running.
It can be best described as choppy, but I guess this is probably a result of high-load on both GPUs limiting how much access the desktop to GPU resources. I’m gonna assume that there isn’t much that can be done except to reduce the GPU utilization of the kernels that run on the display driver GPU. Is there some way easy way to cap it at about 85-90%, or is it a matter of fine-tuning the number of blocks running on each kernel?
0.1 seconds is the delay limit above which humans will start feeling unresponsiveness, so your new design seems fundamentally sound to me, and I am a bit surprised to hear that you observe noticeable choppiness.
The choppiness may be a result of the launch batching that happens with WDDM drivers. It may also be due to the fact that your app is throwing a gazillion kernel launches at the GPUs, as fast as it can, increasing the latency of GPU work launched by other applications.
Off the top of the head I don’t know of a way to directly throttle GPU use in the manner you desire. Many GPU-accelerated projects that use a volunteer “cloud”, such as Folding @ Home, set the base priority of their Windows clients to “Low”. This presumably improves the chances of processes running with “Normal” base priority of getting their GPU work launched quickly. The Windows Desktop Manager itself already runs at “High” priority.
I guess you could also experiment with nanosleep in your application, that is, forcing the thread that launches the CUDA kernels into a very brief sleep every N kernel launches.
In all likelihood there is no perfect solution as long as the GPU is shared between the GUI and a very compute-intensive application. I run Folding @ Home a lot on a PC with a single GPU, and the GUI is at times a bit slow when I do that, although there is no pronounced choppiness.
Thanks njuffa, that’s an ingenious solution! I just quickly added a brief sleep of 1 millisecond in between kernel launches and that greatly improved the choppiness of the GUI. I think its now at a level of usability I’m comfortable with, although obviously it won’t be able to run anything else GPU intensive simultaneously. I really appreciate the help!