6 weeks ago we built a PC with rtx4090 card.
We have randomly found that nvidia-smi returns:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Shutting down doesn’t resolve it. Reinstalling drivers seems to be what gets the card back for us.
We use nvidia NGC TF container per official instructions in docker to run our model training. I think we can say it’s a pretty simple setup that way.
So: nvidia drivers, docker, NGC container, run our code.
We tried both apt install nvidia driver and NVIDIA-Linux-x86_64-525.105.17.run
driver.
We used: docker pull nvcr.io/nvidia/tensorflow:23.03-tf1-py3 and a previous container version as well.
Attached is the debug log.
nvidia-bug-report.log.gz (163.0 KB)
Can someone advise how we can address or begin to better define this issue?
PC:
ASUS TUF Gaming GeForce RTX 4090 OC Edition (PCIe 4.0, 24GB)
GIGABYTE Z690 AORUS ELITE AX DDR4 LGA 1700 Intel Z690 ATX Motherboard with DDR4
Intel Core i7-12700K Desktop Processor 12 (8P+4E)
Seasonic VERTEX GX-1200, 1200W 80+ Gold, ATX 3.0/ PCIe 5.0 Compliant
Kingston FURY Beast 64GB DDR4 3600MHz
WD Black SNESOX 2TB PCIe Gen4 NVMe