I’m training models using a Dell 5860 Precision tower with Nvidia RTX A5000 card. I’ve had occasional instances of Windows updates that affect my driver so that Tensorflow stops sees my GPU. A driver or Cudnn update fixes this. This has been happening more frequently. Currently, Tensorflow sees my GPU, loads images and creates models using shared memory, then my CPU ramps to 100% and kills the training about a minute into the first epoch. The use of shared memory appears consistent with what I witnessed when the system worked. I had a brief temp spike in the GPU about a month ago, and I haven’t seen it again. I’ve received KillKernel Windows errors 193 and 141 from Windows Reliability Tracker. I’m required to use Windows so Tensorflow, Cuda, Cudnn are installed through WSL while the driver is installed through Windows Native.
System Params:
Windows 11
WSL2
Python 3.12.2
Troubleshooting:
- Updated drivers and reinstalled current TF version.
- Tried Tensorflow 2.19, 2.18, 2.17, 2.16, 2.14 along with Cuda and Cudnn from this doc Build from source | TensorFlow. Nvidia-smi ensures the right driver is installed and nvcc -V ensures the Cuda version is correct. TF is installed with pip install tensorflow[and-cuda]==v.v.v
- Tried installing Cuda and Cudnn following using sudo apt-get and through Windows Native using a custom install to ensure the driver packaged with Cuda isn’t installed, so the latest drive is actually being used.
- Tried different drivers with each version of TF, Cuda, Cudnn. I’ve tried driver 573.42, 573.48, 576.52, 576.02
- Removed my GPU, cleaned it, and reinstalled.
- Reimaged the desktop in case the Windows KillKernel errors referred to hw or drivers other than my Nvidia card.
Other than using the warranty on my desktop I have no idea what to do. Thanks in advanced for any help.