Needing expert advice..

Hello all,

I keep encountering a really disturbing issue with the use of NVidia’s GPUs for all of my 64-bit Windows GPGPU apps, so I thought that now would be as good a time as any to ask about it.

The issue is twofold.

First, I’ve verified that while the NVidia GPU is executing a kernel, the execution speed of the CPU slows down by a whopping 61 percent !!

Second, while the NVidia GPU is executing a kernel, the Windows GUI is completely unresponsive.

Now, I know what you’re thinking - that this is obviously some king of bus contention issue, right?

Well, no, actually. That’s what I thought for the longest time too. But the application where this is most painfully obvious is one in which 99.9 percent of the CPU’s execution time is spent processing only its own CPU registers - no memory access required. Furthermore, the single GPU kernel is also spending 99.9 percent of its time processing only its own GPU registers.

So 99.9 percent of the time, both the CPU and GPU are accessing only their own respective registers.

So why should the GPU be so detrimentally interfering with the performance of both the CPU, and the Windows GUI?

I can understand if one or more of the GPU’s registers are actually implemented (under the hood) as GPU memory, but that still wouldn’t explain the extreme CPU slowdown, because the CPU, for all intents and purposes, isn’t accessing any memory at all. And I’m not even sure it would adequately explain the detrimental influence on the Windows GUI, because the GPU shouldn’t be using the system bus to access its own memory anyway (right?)…

I also, at one point, thought that the running GPU kernel is clobbering the Windows GUI because the Windows GUI is waiting on DirectX which is, in turn, waiting on an open slot in the GPU. So I changed the program code to only use three quarters of the available (concurrent) GPU threads. But that didn’t change anything…

This isn’t a WDDM issue either. I’ve turned off the WDDM GPU timeout by setting the following registry values:

HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrLevel = 0, and
HKLM\HARDWARE\DEVICEMAP\VIDEO\MaxObjectNumber = 0,

the latter of which is only necessary for Windows 64-bit, which is what I’m running…

Speaking of which, this is what I’m currently using:

Microsoft Windows 7 Professional 64-bit,
Intel CPU: Dual Core “i5” running @ 2.4 GHz
NVidia GPU: GeForce GT 525M
NVidia Optimus: installed
Installed RAM: 8 GB

And the specs for the aforementioned app is:

Windows 64-bit C++ app (w/ some assembler),
Framework used: none (“native” Windows API only),
Compiler: Microsoft Visual C++ 2010, version 10.0.40219.1 SP1Rel,
NVidia Driver: nvcuda.dll, version 337.88,
NVidia Interface used: NVidia Driver API only.

So is there something I can do to either mitigate or circumvent this awful situation? Am I missing some critical piece of information? Should my program be doing something it isn’t? Anyone?

Second, while the NVidia GPU is executing a kernel, the Windows GUI is completely unresponsive.

Indeed. When the GPU is running a kernel, it is unavailable to service any sort of WDDM display functions. This is not a function of any aspect of the size of your kernel, nor does it have anything to do with the WDDM TDR timeout. Turning off the TDR timeout simply means that the unresponsiveness can persist for as long as the kernel runs, rather than being summarily terminated by the TDR timeout (and subsequent reset of the GPU).

First, I’ve verified that while the NVidia GPU is executing a kernel, the execution speed of the CPU slows down by a whopping 61 percent !!

When the CPU has hit some synchronizing instruction after a kernel call (such as cudaMemcpy, or cudaDeviceSynchronize), which is a very common programming construct, the CPU (thread) will be effectivly doing nothing. Even multithreading techniques will not necessarily cause a yield to productive code, and even if they do, the yielding/polling process necessarily implies some loss of efficiency.

So is there something I can do to either mitigate or circumvent this awful situation? Am I missing some critical piece of information?

To avoid the GPU interfering with WDDM display, making it unresponsive, you can:

  1. make sure your kernels run for a short duration. A few millisecond kernel duration (even if many are being called) will not have a major effect on GUI responsiveness.
  2. move the CUDA tasks to a GPU that is not in WDDM mode. On windows, the TCC option is designed for this. A GPU in TCC mode cannot support a display, is not subject to the WDDM TDR watchdog, and, since it is not servicing any display, will have no impact on GUI responsiveness. (TCC mode is not an option for GeForce GPUs, and since your mobile GPU is servicing the laptop display, you would not want to eliminate the display function anyway).

To avoid the reduction in CPU thread performance due to spinning at synchronization points, do as much CPU work as possible immediately after the kernel call, before any cuda API functions are encountered (especially those that may involve synchronization, such as cudaMemcpy or cudaDeviceSynchronize).

(none of this has anything to do with bus contention)

txbob: Thanks very much for your very informative and timely response. It’s quite reassuring to know that these issues are not as esoteric as I initially thought…

So I can, as you’ve suggested, chop up my kernel into ‘byte-size’ chunks (excuse the pun), but since that makes the kernel, and the code that invokes it, considerably more complicated, I thought I’d hold off on doing that until I verified (by asking you guys, the experts) that there was nothing else that could be done…

However, all of my apps, including this particular one, are written to utilize all available GPUs that can be detected, so I’m thinking now that I could probably get away with ‘chopping up’ the kernel for only those GPUs that are utilized by the display, and using the ‘intact’ kernel (the one I have now) for all others. Since I already have code to detect whether a certain GPU is utilized by the display or not (by using the Driver API’s cuD3D11GetDevice() function), it should still circumvent the Windows GUI problem if/when running on a multiple-GPU machine, right?

Speaking of which, I’m not sure that I understand your point pertaining to the CPU. To clarify further, allow me to elucidate the program logic that I currently have, whcih can be ‘paraphrazed’ (using C-like syntax) as the following:

static bool KeepGoing;

do
{   if ( KeepGoing = ReallyLongCPUTask() )  /* <== about 5 seconds, written in assembly, no mem access (to speak of).. */
    {
         if ( cuCtxSetCurrent( ... ) == CUDA_SUCCESS && cuStreamQuery( ... ) == CUDA_SUCCESS )
         {
              if ( KeepGoing = AnyHostBytesZero() )  /* <== really fast check of resultant 'host array' ( uses memchr() ).. */
              {
                   cuMemsetD32Async( ... );     /* <== only sets one 32-bit value in the kernel..                         */
                   cuLaunchKernel( ... );       /* <== launches the << really long >> kernel - no mem access..            */
                   cuMemcpyDtoHAsync( ... );    /* <== occurs post-kernel - copies 2,304 (result) bytes to 'host array'.. */
              }
         }
    }
}
while ( KeepGoing );

I’ve left out all function parameters in the above for readability…
Also, all of the above is done with only one non-zero CUDA Stream per GPU, if that matters…

As you can see from the above, I’m using the cuStreamQuery() function exclusively for synchronization with the host, which is an asynchronous function call by definition.

So where is the holdup? As far as I can figure, the cuLaunchKernel() call should return immediately, as should the following cuMemcpyDtoHAsync() call (because they’re both asynchronous), then the ReallyLongCPUTask() gets called, which is where the CPU spends the next five seconds…

So during the execution of the ReallyLongCPUTask() call, the GPU should be executing the kernel… And since the kernel itself does absolutely no thread synchronization (or any other kind of synchronization), what could be slowing down the CPU?

BTW, the execution of the kernel takes way longer than the execution of ReallyLongCPUTask()… That’s mostly because its trying to accomplish the same thing as ReallyLongCPUTask(), only its doing it for 2,304 different values instead of one…

So as far as I can tell, I’m already doing, as you suggested, “as much CPU work as possible immediately after the kernel call”…

So still mystified about the CPU at this point… Just want to understand either what’s going on, or what I might be doing wrong…

On another note, is there a way to detect whether or not a GPU is in TCC mode or WDDM mode? I don’t remember seeing that in the Driver API, but I haven’t really looked that hard for it…

Anyway, thanks again for helping out. Still some questions, but at least now I have a direction (if not a plan)…

Since I already have code to detect whether a certain GPU is utilized by the display or not (by using the Driver API’s cuD3D11GetDevice() function), it should still circumvent the Windows GUI problem if/when running on a multiple-GPU machine, right?

If you disable TDR timeout, and a particular GPU is not hosting a display, I would think that the CUDA tasks launched to that GPU should not impact GUI responsiveness, if that is your key complaint. If the GPU is a WDDM GPU, it is subject to WDDM limitations. My recommendation is to use a TCC GPU (if on windows, if possible).

So where is the holdup?

I’m not sure. The usual suggestion at this point would be to profile your code. The profiler will tell you whether any host processes are being held up due to some synchronizing effect of a cuda API call. My statement about synchronization was made without any knowledge of your code. It’s one possible factor, but if it doesn’t apply in your case, then you should use available tools to dissect the execution behavior of your host code to determine where the loss of efficiency is occurring. If it is occurring due to a cuda API call, it should be pretty obvious from profiling results.

On another note, is there a way to detect whether or not a GPU is in TCC mode or WDDM mode?

It can be done with the nvidia-smi tool programmatically, or via the NVML API.

Thanks for coming through again, txbob.

I too see no reason why I shouldn’t be able to execute long-running kernels on GPUs that aren’t being used by the display. So the plan remains as outlined…

As for why the CPU is slowing to a crawl, I’ll have to write the new (‘chopped up’) code before I can continue testing (and/or profiling)…

But I did have a thought just before I read your latest reply - what if it’s the WDDM that is slowing down the CPU, because it’s busy “spinning” on the GPU, instead of blocking?

That would make a certain (but perverse) kind of sense, because I don’t believe there’s even a way to ‘block’ a thread while waiting on the GPU, because the GPU isn’t a valid Windows ‘synchronization object’ (as defined by the Win API’s ‘Synchronization Functions’, such as the WaitForSingleObject() function). So the WDDM would have no choice but to ‘spin’ instead of block, and given its high priority status, would probably eat up a lot of clock cycles just ‘spinning its wheels’, so to speak…

Just a thought. I’ll know more after I’ve updated the code…

Happy computing…