Random Xid 61 and GPU disappears (RTX 2080 Ti, 440.64 driver, Ubuntu 20.04)

We regularly receive this error:

/var/log/kern.log:
Jun 15 16:45:28 hal2 kernel: [    1.752052] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
Jun 15 16:45:28 hal2 kernel: [    1.801781] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  440.64  Fri Feb 21 01:17:26 UTC 2020
Jun 15 16:45:28 hal2 kernel: [    1.811054] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  440.64  Fri Feb 21 00:43:19 UTC 2020
Jun 15 16:45:28 hal2 kernel: [    1.812681] [drm] [nvidia-drm] [GPU ID 0x00006800] Loading driver
Jun 15 16:45:28 hal2 kernel: [    1.813031] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:68:00.0 on minor 0
Jun 15 16:45:28 hal2 kernel: [    5.157959] nvidia-uvm: Loaded the UVM driver, major device number 511.
Jun 15 16:48:42 hal2 kernel: [  208.455485] NVRM: GPU at PCI:0000:68:00: GPU-2a9d6094-368e-e43d-6843-cea9465affed
Jun 15 16:48:42 hal2 kernel: [  208.455486] NVRM: GPU Board Serial Number: 
Jun 15 16:48:42 hal2 kernel: [  208.455487] NVRM: Xid (PCI:0000:68:00): 61, pid=2324, 0cec(3098) 00000000 00000000
Jun 15 16:48:59 hal2 kernel: [  224.736826] NVRM: GPU 0000:68:00.0: RmInitAdapter failed! (0x24:0x65:1185)
Jun 15 16:48:59 hal2 kernel: [  224.736854] NVRM: GPU 0000:68:00.0: rm_init_adapter failed, device minor number 0

after a few minutes of execution with

Intel i9-9940X
WS X299 SAGE/10G (bios rev. 2002)
Asus GeForce RTX 2080 Ti 11GB Turbo Edition
Corsair Vengeance LPX 128GB DDR4 2666
HP EX950 M.2 2TB SSD
CORSAIR AX1600i PSU

Ubuntu 20.04
Linux 5.4.0-37-generic
Drivers nvidia-driver-440 (440.82+really.440.64-0ubuntu6)
https://packages.ubuntu.com/focal/nvidia-driver-440

Secure Boot is disabled and we tried the pcie_port_pm=off kernel parameter.
I enclose a report before and after the crash (hal2_before… and hal2_after…) .

We have built and configured an identical box (hal1) where the problem is not showing up (I enclose the report for that box as well). We tried to swap the GPUs, PSU, PSU cables between the two boxes, but the problem persists (only on hal2).

Occasionally, the error occurs after a longer period of time (about 15 minutes while watching the output of nvidia-smi, or 10 minutes of training with TensorFlow). After this error, some or all of the GPUs are not found by nvidia-smi.

hal2_before_prob_nvidia-bug-report.txt (998.5 KB) hal2_after_prob_nvidia-bug-report.txt (1.3 MB) hal1_nvidia-bug-report.txt (2.6 MB)

Please see this:
https://forums.developer.nvidia.com/t/random-xid-61-and-xorg-lock-up/79731/185?u=generix