I am trying to train a machine learning model using Tensorflow on my Ubuntu 20.04 server with Cuda 11.2 and CuDNN 8.1 installed. Unfortunately the GPU crashes and falls of the bus as can be seen by running the dmesg
command:
[ 517.195242] NVRM: GPU at PCI:0000:0a:00: GPU-7a2f2bd6-a848-bf8e-0541-09ef347fba71
[ 517.195246] NVRM: GPU Board Serial Number: 1322721012372
[ 517.195248] NVRM: Xid (PCI:0000:0a:00): 79, pid=0, GPU has fallen off the bus.
[ 517.195274] NVRM: GPU 0000:0a:00.0: GPU has fallen off the bus.
[ 517.195276] NVRM: GPU 0000:0a:00.0: GPU is on Board 1322721012372.
[ 517.195290] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
In my debugging attempts, I manually tested that the GPU does not crash due to:
- the lack of power, by limiting the power usage to 250W via
nvidia-smi -pl 250
. - overheating, by monitoring the temperature via
nvidia-smi --query-gpu=timestamp,temperature.gpu
, which never crossed 80 degrees - an out-of-memory error of the GPU, via
nvidia-smi --query-gpu=timestamp,memory-free
, which was at its minimum 600MB - a problem with my RAM, by running
memtester
multiple times.
What is the reason for the GPU falling off the bus? For me this seems to be a hardware problem?
nvidia-bug-report.log.gz (262.8 KB)