Cuda application crashes occasionally + displays flicker (windows resets the GPU). gaining exclusive access?

OrenPerets · January 30, 2022, 9:54pm

Hi,

Our application uses Cuda to accelerate some aspects of image processing. it is used on a “semi real time” application.

P5000 on Windows 10 (build 1809).
WDDM tdr timeout is set to 10 seconds.
Typical running compute takes less than a second
Application uses ~10GB of the total 16 on board (we do see that after memory allocation there is less than 1.2GB free on the device).

Occasionally (rare, but happens) we get a “Cuda Launch failure” error from GPU accompanied by displays going off and on (and windows logs show GPU was reset).
nvidia-smi.exe revealed windows is using GPU extensively (up to 15 additional applications over our code).
The windows release we run has no option to disable HW acceleration.

Any idea why this could be happening?
Any way to guarantee exclusive access to the GPU Computing (we drive the displays using the same device in parallel)?

thanks
Oren

njuffa · January 30, 2022, 11:40pm

Adding instrumentation and logging to your application may reveal which portion of the software triggers the timeouts and under which conditions. You may find that kernel execution time varies more widely than anticipated. Have you characterized the kernel(s) of interest with the help of the CUDA profiler? Do you have a roofline performance model for it? That might provide clues.

With the amount of information provided, the possibilities are boundless. Brainstorming in random order:

There is a corner case, overlooked in the design phase, in an iterative part of kernel code that leads to a significant increase in runtime depending on the data being processed because of higher trip counts. The triggering data could be outside the expected range.

The kernel invokes device functions (including built-in functions provided by CUDA) that have variable execution time based on function argument(s). In a rarely invoked corner case, the code consistently hits the slowest path through those functions, significantly increasign execution time.

The memory access pattern of the kernel may be data dependent, for example, by use of index vectors for indirect addressing. Memory throughput can easily vary by a factor of ten depending on access pattern. Occasionally, the kernel hits a low-throughput case, significantly increasing kernel execution time.

There may be invalid inputs (such as size information) either to kernel arguments or launch configurations. This could be due to an uncaught integer overflow in the computation that computes launch configurations, or sizes of data structures, or something else.

There could be insufficient CUDA status check coverage, causing (local) data corruption downstream of the missing check, which in turn triggers one of the above cases.

If you brainstorm among the people in your development team, you will likely come up with many more potential scenarios.

Robert_Crovella · January 31, 2022, 3:21am

It looks like a typical description of a kernel time-out.

njuffa · January 31, 2022, 4:03am

I assumed that the OP is already aware that they are experiencing a TDR timeout, which is why they have bumped up the timeout limit to 10 seconds, but are looking for possible reasons that higher limit is exceeded despite the kernel(s) only running 1 second on average.

My apologies if I misinterpreted the question.

Robert_Crovella · January 31, 2022, 4:09am

Yes, that is the way I read your answer also. However I felt that there was a small chance that OP was asking the question most generally, so I thought for clarity it might be useful to be specific in that way.

Your answer is great and describes next steps, once it is stipulated that the core issue is a kernel time-out.

OrenPerets · January 31, 2022, 8:08am

Thank you.

We have extensively characterized (>> 1Million) runs, both with time measurements as well as Cuda profiler.
running time STDEV is ~30millisec, including the first image treated (longer than later images). overall +/-100 milliseconds.
This never happens in the testing environment, regardless of how we push the code around (in terms of parameters).

We are looking for issues in the kernels as well. one assumption is image corruption somehow.
could this be happening due to the infrastructure swapping memory between GPU tasks (mine and others)? a swap of up to 10GB…

moreover, the code (GPU code and launching code) have not changed (even not compiled) in the last ~18 months, and yet we got a surge of occurrences on a specific machine. confusing…

Thanks for the ideas… we will look into it.

Oren

njuffa · January 31, 2022, 8:26am

Ideally the testing environment should match the production environment closely. How different are the respective systems? Do they use the same type of GPU, for example? You may need to eliminate differences in hardware and software configuration between the two systems one variable at a time (controlled experiments) to get a better handle on what the bug correlates with.

It is certainly possible to have latent bugs in one’s code that manifest just on certain GPUs, in particular when changing between GPU of different architectures. These are most often some kind of race condition. Possibly but rarely do they turn out to be some sort of compiler issue.

Another source of weird bugs can be access to uninitialized memory or out-of-bounds memory access on the host. I once debugged a bizarre case where correct/incorrect behavior of a CUDA-accelerated application was dependent on the setting of completely unrelated environment variables. After a long search, I finally traced that back to an out-of-bounds access in host code that by pure chance happened to hit in the application’s environment. Have you tried valgrind on your host code?

Has the application ever worked flawlessly on that machine? If so, what changes in hardware and software configuration have occurred since then? Is this machine deployed in some sort of harsh environment (extended temperature range, high-altitude operation, vibrations from internal or external sources, electromagnetic interference e.g. from electric motors)? Many machines have built-in diagnostics in the BIOS that can be entered during a cold boot. Does the machine pass all these diagnostics in the thorough/detailed/extended mode? Do any system logs show GPU related errors? When monitoring with nvidia-smi are there any signs of GPU overheating?

Is there a possibility of insufficient power supply? Conservatively, for rock-solid long-term stable operation in a GPU accelerated system, the sum of nominal wattage of all system components should not significantly exceed 60% of the nominal wattage of the power supply. The reason for the rather large margin is (1) the aging of active and passive electronic components over an assumed 5 years of 24/7/365 operation (2) short-duration power spikes commonly occurring with both modern CPUs and GPUs.

OrenPerets · February 1, 2022, 8:07pm

the application is deployed in multiple installs as well as in-house systems. worked flawlessly for ~18 months now.
we are simulating and running on the same HW (intel server, P500 GPU).
the direction of power surges / power supplies Vs aging is interesting. definitely some of the HW is prone to aging there.

system is in an (air conditioned) cabinet in most installations. not sure regarding this specific in-lab unit. will also check this.

thank you for the help. appreciated very (very!) much.

Oren

njuffa · February 1, 2022, 8:29pm

If the problem occurs in just one of several identically configured machines, I would try “voodoo maintenance”:

Unpower the system, open the case, carefully remove all the accumulated dust, unplug any auxilliary power connectors and remove the GPU from PCIe slot, blow out the dust from the GPU’s (fan-)heatsink assembly, for good measure also clean dust adhering to CPU fan, visually inspect PCIe slot for any pin damage, corrosion, or lodged dirt, re-insert GPU (make sure any mechanical locking mechanisms engage), re-plug GPU auxilliary power connectors if any (again making sure any mechanical locking mechanisms engage), close system, power up system.

Why “voodoo maintenance”? Because it sometimes “magically” fixes hardware-related problems. Dust-clogged fans and air ducts can lead to overheating, separating then re-inserting connectors is often enough to remove very thin oxidation layers on connector fingers and pins which could interfere with signal integrity.

You might also want to double check that (1) all systems run the same SBIOS version (2) all systems run the same operating system version (3) all systems run the same NVIDIA driver package. And unless operational requirements prohibit it, you would probably want the latest applicable versions of SBIOS/OS/driver installed.