H100 GPU memory allocated but GPU usage 0%

OS system: CentOS steam 8
kernel : 4.18.0-522.el8.x86_64
cpus: 128 processor
memory: 1T
Driver version: 545.23.08
CUDA Version: 12.3

python version: 3.11.5
tensorflow-gpu: 2.4.1
pytorch: 1.4.0

Issues: Training model with GPU memory allocated and GPU usage 100%, after a few weeks later, training model GPU memory allocated but GPU usage 0. Restart GPU Server fix the issue.
How could I fix the issue?

restart the server

kill all processes that were spun up as a result of the GPU usage that you did

This question has been asked in many places, a bit of searching will show other suggestions.