The CUDA program on my computer has suddenly become unstable with A4000

The CUDA program on my computer with an A4000 graphics card has been processing images for 5 days without shutting down, and there were no issues during this period. However, after running a CUDA kernel at some point, the performance occasionally experienced excessive time consumption, and this issue has persisted for two days without a PC restart. This is the performance of some CUDA kernels on this machine. In addition to the CUDA kernel I am using, even a simple CUDA kernel demo consistently exhibits this issue when called repeatedly in a loop. Initially, I suspected high memory usage, overheating of the graphics card, or the use of shared memory, but after ruling out these possibilities, the problem still persists, and it feels more like there is an issue with the internal state of the graphics card. There have been similar situations before, where the CUDA program occasionally took a long time to process images during execution, normally taking a few tens of milliseconds, but suddenly jumping to over two hundred milliseconds. After ruling out the overheating of the graphics card, this issue has occurred on multiple A4000 graphics cards and RTX4000 graphics cards, but it would recover after about ten minutes. I would like to ask what might be causing this sudden increase in CUDA kernel time consumption, and whether it is related to the CUDA program that I ran for 5 days. Is it possibly caused by memory overflow? However, even after I terminated the CUDA program process, the issue still exists.
20241012-180009



20241012-180028

Have you tried TCC driver mode instead?

1 Like

I second the recommendation to try the TCC driver.

However, in my experience jitter observed when using the WDDM driver has never been on the order of 90 milliseconds as is seen here. While the operating system controls GPU memory allocation when WDDM is used, any delays caused by that should affect the execution time of allocation functions in the CUDA API, not the execution time of launched kernels.

If this is a system with dual CPUs, it might also be interesting to try fixing processor and memory affinity. On Linux one would do this with a toll like numactl, but although I use Windows regularly I don’t know what the equivalent control mechanism is. However, this experiment could be seen as even more grasping at straws than switching to the TCC driver.

One effect that can cause significant difference in kernel execution time is the dynamic clocking of GPUs, where transitions between power states and clock boosting steps are not instantaneous but occur with a certain hysteresis. Depending on the cadence and duration of kernels a kernel launch could therefore occur in “near idle” states or “full power” states. Since in this case the kernels are issued in a tight loop, this should not apply here.

There are other throttling reasons besides thermal throttling. It might be useful to use a tool like GPU-Z to continuously monitor all PerfCap reasons. I recall one case reported in this forum of a defect in one of the auxiliary power connectors of a GPU, so that the GPU would intermittently lose roughly half its power supply, causing severe throttling.

2 Likes

Thank you.The performance fluctuation phenomenon is somewhat peculiar. Initially, there were no issues. However, after using our software’s CUDA program for image processing continuously for five days, the issue occurred during a subsequent use. There have been similar occurrences in the past; after continuous use of our software for several days without shutting down the device, there was a low probability of experiencing performance fluctuations lasting from several minutes to an hour. The computation that should have taken 30ms spiked to over 200ms. On one occasion, I recorded the temperature and found that it was not high when the issue occurred, only around 40 degrees Celsius. I am concerned that there might be some memory overflow errors or similar issues in our software’s CUDA program code, which could be causing internal errors in the NVIDIA graphics card driver, leading to this performance fluctuation. Now, this phenomenon consistently occurs on my PC, so I am hesitant to restart the machine. I would like to ask for help in determining if there might be memory overflow errors or similar issues in our software’s CUDA program code. I have reviewed a large amount of software code and have not yet found any issues. After running the CUDA demo program hundreds of times, there seems to be some periodic pattern, but the GPU usage rate remains relatively stable even when idle.
img_v3_02fj_4f3c02b2-73f8-4baa-932c-57cebd923f7g

Is this system just being used to run your image processing task?

I ask, as the nvidia-smi output in the first post shows a number of other programmes using GPU resources and GPU demands from them could well have an effect on this.

Not yet, my PC doesn’t have integrated graphics, so I’m using WDDM mode

Thank you,but my PC doesn’t have integrated graphics,I’d like to ask if the issue of CUDA program memory overflow could be the cause of this phenomenon. In the past, I’ve experienced performance jitter due to memory overflow, but it was much more severe, causing stuttering for over a second, and continuous operation might lead to CUDA errors. However, this phenomenon has not resulted in any CUDA errors,it’s just a sustained period of performance issues.

@rs277 @njuffa @Curefab This device is a development and testing machine for our software, currently without integrated graphics, and it uses an NVIDIA Quadro A4000 graphics card.The devices in our production environment also have similar configurations, without integrated graphics, and mostly use RTX 4000 or RTX A4000 graphics cards. The issue that arises in the production environment is that after running for a few days, there is a sustained period of performance degradation, which could last for several minutes or even tens of minutes, before returning to normal use. This problem had not been reproduced before. It wasn’t until the other day when it occurred on our testing and development machines, also after running a kernel for a period of time, that we experienced performance degradation that has not returned to normal.

I don’t know what is meant by “memory overflow”.

Failures in CUDA memory allocation functions should be caught by the application, with abnormal termination likely the most appropriate response. Does the application have appropriate checks in place?

Are you referring to operating system initiated swapping of system memory to backing storage by any chance? You can continuously monitor GPU and system memory usage with the free GPU-Z tool from TechPowerUp that I mentioned earlier.

The CUDA software stack competes with other software in the system for certain resources, in particular the CPU, system memory, and PCIe. Any (intensive) use of the these resources by non-CUDA software can have a negative impact on CUDA performance. Have you conducted a controlled experiment, where the system is completely idle (including background tasks such as Windows telemetry and antivirus check) other than your image processing application?

You don’t need a system with integrated graphics. Any discrete second GPU in the system (this can be a low-end CPU) can serve the operating system’s GUI, freeing the compute GPU to be used with the TCC driver, provided that GPU has TCC support. The A4000 should certainly have such support.

[Later:]

I noticed belatedly that the test program in the original post contains a call Sleep(100). Does the problem of significantly varying kernel execution times disappear when this is removed? This is a follow-up to my hypothesis that the execution time differences are side effect of dynamic GPU clocking.

A GPU not being used for a certain amount of time will fall into power-save mode, typically running at something like 300 MHz and with minimal PCIe bandwidth. Subsequent short-time usage of the GPU may be too short to cause power state and clock boosts to take effect before the kernel ends (the hysteresis effect I mentioned).

If this turns out to be the immediate cause of the obervations, you can try to lock the GPU clocks with nvidia-smi --lock-gpu-clocks (note that this may lead to significantly increased power draw / energy usage).