We regularly receive this error:
/var/log/kern.log: Jun 15 16:45:28 hal2 kernel: [ 1.752052] nvidia-nvlink: Nvlink Core is being initialized, major device number 235 Jun 15 16:45:28 hal2 kernel: [ 1.801781] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.64 Fri Feb 21 01:17:26 UTC 2020 Jun 15 16:45:28 hal2 kernel: [ 1.811054] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 440.64 Fri Feb 21 00:43:19 UTC 2020 Jun 15 16:45:28 hal2 kernel: [ 1.812681] [drm] [nvidia-drm] [GPU ID 0x00006800] Loading driver Jun 15 16:45:28 hal2 kernel: [ 1.813031] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:68:00.0 on minor 0 Jun 15 16:45:28 hal2 kernel: [ 5.157959] nvidia-uvm: Loaded the UVM driver, major device number 511. Jun 15 16:48:42 hal2 kernel: [ 208.455485] NVRM: GPU at PCI:0000:68:00: GPU-2a9d6094-368e-e43d-6843-cea9465affed Jun 15 16:48:42 hal2 kernel: [ 208.455486] NVRM: GPU Board Serial Number: Jun 15 16:48:42 hal2 kernel: [ 208.455487] NVRM: Xid (PCI:0000:68:00): 61, pid=2324, 0cec(3098) 00000000 00000000 Jun 15 16:48:59 hal2 kernel: [ 224.736826] NVRM: GPU 0000:68:00.0: RmInitAdapter failed! (0x24:0x65:1185) Jun 15 16:48:59 hal2 kernel: [ 224.736854] NVRM: GPU 0000:68:00.0: rm_init_adapter failed, device minor number 0
after a few minutes of execution with
Intel i9-9940X WS X299 SAGE/10G (bios rev. 2002) Asus GeForce RTX 2080 Ti 11GB Turbo Edition Corsair Vengeance LPX 128GB DDR4 2666 HP EX950 M.2 2TB SSD CORSAIR AX1600i PSU Ubuntu 20.04 Linux 5.4.0-37-generic Drivers nvidia-driver-440 (440.82+really.440.64-0ubuntu6) https://packages.ubuntu.com/focal/nvidia-driver-440
Secure Boot is disabled and we tried the
pcie_port_pm=off kernel parameter.
I enclose a report before and after the crash (hal2_before… and hal2_after…) .
We have built and configured an identical box (hal1) where the problem is not showing up (I enclose the report for that box as well). We tried to swap the GPUs, PSU, PSU cables between the two boxes, but the problem persists (only on hal2).
Occasionally, the error occurs after a longer period of time (about 15 minutes while watching the output of nvidia-smi, or 10 minutes of training with TensorFlow). After this error, some or all of the GPUs are not found by nvidia-smi.