failed to call to cuInit

  1. System: CentOS Stream 8
  2. 128cpus, 1T memory , H100x2
  3. CUDA Driver: 560.35.03 CUDA Version: 12.6
    Server run a few weeks, when I cann’t access GPU,
    nvidia-smi worked well
    torch :CUDA driver initialization failed.
    tensorflow: failed call to cuInit
    Restart server can solve the issue, but it’s server and I cann’t ask IT to reboot it for it was running other users’ job on it.