Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

OS: Ubuntu22.04
Driver Version: NVIDIA driver metapackage from nvidia-driver-535
GPUs: 2*3090

When I train the llama2-13b model (using dual gpu), the gpu on top of my pcie1 seems to come off.

When I try to check it with nvidia-smi, I get
“Unable to determine the device handle for GPU0000:01:00.0: Unknown Error”

I’ve run and failed many times since this afternoon. At first the model seemed to train for over an hour before reporting failure. As of this evening, the model training runs fail within a few minutes and are accompanied by a fan spinning very loudly as it fails.

Some of my own experiments:

  1. reboot, didn’t work.
  2. reinstalled graphics drivers, no help
  3. at night i tested each card individually.
    3.1 training with a single graphics card on pcie1, quickly reporting an error (1 second before the error, the gpu core temperature is only 40-50 C)
    3.2 Using a single card on pcie3, I can train the full model normally.

Here is my nvidia-bug-report.log
nvidia-bug-report.log.gz (643.1 KB)

Could someone give me a hand? Thank you very much!