Nvlddmkm issue leading to BSOD during AI inference

On a server, I have two NVIDIA L4 GPUs which are being used for AI model inference to perform anomaly detection in image data. Since August 22, 2025, a weird issue started happening. The server rebooted at random times in the middle of the inference job, without any noticeable increase in temperature, or memory utilization! On another computer with a Quadro RTX 4000 GPU, I never faced this issue, and even on weaker laptop GPUs I have no such issue.

When I tried to root cause the issue, all the sudden reboots (BSOD) were pointing to the nvlddmkm.sys module which is part of the NVIDIA driver for GPUs. Seems like a fault rooting from this module leads to the critical event ID 41 on WinServer 2025 (source: Microsoft-Windows-Kernel-Power).

Consulting the online forums, tried rolling back the driver and CUDA toolkit, but it didn’t fix the issue. I tried multiple things from online forums with no luck. I suppose this can have something to do with incompatibilities between Windows Server 2025, WSL 2, Docker Desktop, and NVIDIA drivers for datacenter GPUs (Tesla), but not sure why we didn’t have this issue before August 22!

Any thoughts or similar scenarios leading to a solution are really appreciated.