We have a GPU machine with 8 3060 GPU and for some reason the system does not detect one of them while training.
Sometimes python app can’t teminator normally.
I have changed the GPU and the issue still exist.
Heres is the log: nvidia-bug-report.log.gz (63.1 KB)
BTW, It is powered by two separate 16A PDUs without other devices
Log after training: nvidia-bug-report.log(1) (4.6 MB)
Unfortunately, the first log is incomplete and the second one inaccessible. Telling by the fisrt log, nvidia-persistenced is not running. Please make sure it starts on boot and is continuously running. Please monitor gpu temperatures while training e.g. using
nvidia-smi -q -l 2 -d TEMPERATURE
nvidia-bug-report.zip (2.9 MB)
Sir,I changed the format of log file and upload it again,now you can explorer it.
In these days,about half of the probability loss a card in training.