I training DL model on GPU and stuck after 20 minutes, also cannot access to GPU by nvidia-smi when the error occur. After rebooting, I can access to GPU by nvidia-smi without error, but when run training program, the problem happened again after training 20 minutes. I using same program to do training DL model many hours without error before, so annoying and wired.
Some root cause and strategy of error:
- Overheating
- Insufficient/unstable power supply
- Replace slot
- System bios updates
I monitor temperatures when training, and the temperatures always < 50 C, so it is not overheating. I Also try enable persistent mode but not work. Finally I using solution from Unable to determine the device handle for GPU xxxxxxxx: Unknown Error, using following command temporarily to solve the error, training program executing stably for 30 minutes and error happened again. I want to how to solve the root cause completely?
temporarily solve the error by:
nvidia-smi -lgc 300,1500
some useful message:
- GPU and driver:
- driver:
515.76 - GPU:
NVIDIA GeForce RTX 3060 - CUDA:
11.7 - Tensorflow:
2.10.0
- driver:
-
dmesg -T[Sun Oct 16 11:45:40 2022] NVRM: GPU at PCI:0000:01:00: GPU-903cc954-07f3-f490-e3d4-7e79bffaa22f [Sun Oct 16 11:45:40 2022] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus. [Sun Oct 16 11:45:40 2022] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus. [Sun Oct 16 11:45:40 2022] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. -
nvidia-debugdump --listError: nvmlDeviceGetHandleByIndex(): Unknown Error FAILED to get details on GPU (0x0): Unknown Error -
nvidia-smiUnable to determine the device handle for GPU 0000:01:00.0: Unknown Error - bug repor
nvidia-bug-report.log.gz (127.3 KB)