GPU is lost while training

Hi sir

We have a GPU machine with 8 3060 GPU and for some reason the system does not detect one of them while training.
Sometimes python app can’t teminator normally.
I have changed the GPU and the issue still exist.
Heres is the log:
nvidia-bug-report.log.gz (63.1 KB)
BTW, It is powered by two separate 16A PDUs without other devices
Log after training:
nvidia-bug-report.log(1) (4.6 MB)

Unfortunately, the first log is incomplete and the second one inaccessible. Telling by the fisrt log, nvidia-persistenced is not running. Please make sure it starts on boot and is continuously running. Please monitor gpu temperatures while training e.g. using
nvidia-smi -q -l 2 -d TEMPERATURE (2.9 MB)
Sir,I changed the format of log file and upload it again,now you can explorer it.
In these days,about half of the probability loss a card in training.

You really need to set up nvidia-persistenced correctly. The temperatures on idle seem fine, you’ll have to monitor them while training, though.

1 Like

nvidia-bug-report.log.gz (2.3 MB)
Sir,The temperaures on training is fine and I have set up the nvidia-persistenced
Here is the new log,Thank you!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.