OS system: CentOS steam 8
kernel : 4.18.0-522.el8.x86_64
cpus: 128 processor
memory: 1T
Driver version: 545.23.08
CUDA Version: 12.3
python version: 3.11.5
tensorflow-gpu: 2.4.1
pytorch: 1.4.0
Issues: Training model with GPU memory allocated and GPU usage 100%, after a few weeks later, training model GPU memory allocated but GPU usage 0. Restart GPU Server fix the issue.
How could I fix the issue?
Thanks.