CPU stuck when accessing the GPU

Hi

We are running HPC cluster with several types of GPUs and CPUs, using the same Linux kernel and Nvidia modules.
We had problems with two of five machines that has the same hardware installation. Obviously we called support that updated the machine’s firmware and replaced hardware, but on one machine the problem persists, even though it is not present when we boot it with another linux installation.
So We want to check the option that the problem is with our self-compiled kernel or custom Nvidia driver installation, and we would appreciate your help.

The machines are GIGABYTE G292-Z45-00 (board MZ42-G21-00)

The GPUs haven’t made any problems on ubuntu 20.04.04 using kernel 5.13.0-39 and Nvidia driver 470.103.01
It had problems using debian 10, self-compiled kernel 5.10.104 and Nvidia 470.86

nvidia-bug-report.log.gz (6.1 KB)

[Mon Apr  4 15:47:47 2022] BUG: kernel NULL pointer dereference, address: 00000000000000b1
[Mon Apr  4 15:47:47 2022] #PF: supervisor read access in kernel mode
[Mon Apr  4 15:47:47 2022] #PF: error_code(0x0000) - not-present page
[Mon Apr  4 15:47:47 2022] PGD 12a8ad067 P4D 12a8ad067 PUD 12a8ac067 PMD 0
[Mon Apr  4 15:47:47 2022] Oops: 0000 [#1] SMP NOPTI
[Mon Apr  4 15:47:47 2022] CPU: 6 PID: 11162 Comm: nvidia-smi Tainted: P           O      5.10.104-aufs-3 #1
[Mon Apr  4 15:47:47 2022] Hardware name: GIGABYTE G292-Z45-00/MZ42-G21-00, BIOS M09 03/15/2022
[Mon Apr  4 15:47:47 2022] RIP: 0010:_nv031719rm+0x79/0x940 [nvidia]
[Mon Apr  4 15:47:47 2022] Code: 07 00 00 41 bf 01 00 00 00 4c 8d 65 48 31 db 44 89 7d 10 66 0f 1f 44 00 00 41 f6 c5 01 0f 84 90 00 00 00 49 8b 86 30 1a 00 00 <80> b8 b1 00 00 00 00 74 12 b8 01 00 00 00 89 d9 d3 e0 41 85 86 94
[Mon Apr  4 15:47:47 2022] RSP: 0018:ffffb6ef0e1cf980 EFLAGS: 00010202
[Mon Apr  4 15:47:47 2022] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000003
[Mon Apr  4 15:47:47 2022] RDX: ffff9d4ad2668008 RSI: ffff9d4aed3b6008 RDI: ffff9d4aeab44008
[Mon Apr  4 15:47:47 2022] RBP: ffff9d4aed9d3ac0 R08: ffff9d4aed9d3b93 R09: ffff9d4aed9d3ba4
[Mon Apr  4 15:47:47 2022] R10: ffffffffc0f13790 R11: 0000000000000000 R12: ffff9d4aed9d3b08
[Mon Apr  4 15:47:47 2022] R13: 000000000000000f R14: ffff9d4aed3b6008 R15: 0000000000000001
[Mon Apr  4 15:47:47 2022] FS:  00007f61db889b80(0000) GS:ffff9d892ed80000(0000) knlGS:0000000000000000
[Mon Apr  4 15:47:47 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Apr  4 15:47:47 2022] CR2: 00000000000000b1 CR3: 000000010e16e001 CR4: 0000000000770ee0
[Mon Apr  4 15:47:47 2022] PKRU: 55555554
[Mon Apr  4 15:47:47 2022] Call Trace:
[Mon Apr  4 15:47:47 2022]  ? _nv031833rm+0x82/0x270 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? _nv031866rm+0x17/0x30 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? _nv022832rm+0xc0/0x1b0 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? _nv022837rm+0x11b/0x230 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? _nv022837rm+0x211/0x230 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? _nv022839rm+0x310/0x310 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? _nv023513rm+0x32d/0x470 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? _nv023513rm+0x304/0x470 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? _nv000721rm+0x32a/0x680 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? _nv000714rm+0x17b2/0x2370 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? rm_init_adapter+0xc5/0xe0 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? nv_open_device+0x11b/0x890 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? nvidia_open+0x1c6/0x4c0 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? nvidia_frontend_open+0x4e/0x90 [nvidia]
[Mon Apr  4 15:47:47 2022]  ? chrdev_open+0x99/0x1a0
[Mon Apr  4 15:47:47 2022]  ? cdev_put.part.3+0x20/0x20
[Mon Apr  4 15:47:47 2022]  ? do_dentry_open+0x144/0x360
[Mon Apr  4 15:47:47 2022]  ? path_openat+0xb3b/0xfc0
[Mon Apr  4 15:47:47 2022]  ? do_filp_open+0x8e/0x100
[Mon Apr  4 15:47:47 2022]  ? do_sys_openat2+0x223/0x2d0
[Mon Apr  4 15:47:47 2022]  ? do_sys_open+0x46/0x80
[Mon Apr  4 15:47:47 2022]  ? do_syscall_64+0x33/0x80
[Mon Apr  4 15:47:47 2022]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Mon Apr  4 15:47:47 2022] Modules linked in: cpufreq_conservative cpufreq_powersave cpufreq_ondemand binfmt_misc cpufreq_userspace quota_v2 quota_tree nvidia_modeset(PO) nvidia_uvm(PO) nvidia(PO) drm ipmi_ssif sg amd64_edac_mod kvm_amd snd_hda_codec_hdmi kvm snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core uas irqbypass snd_pcm acpi_ipmi snd_timer usb_storage crc32_pclmul wmi_bmof snd ipmi_si rapl evdev pcspkr joydev soundcore k10temp wmi ipmi_devintf ipmi_msghandler tiny_power_button acpi_cpufreq button i2c_dev parport_pc lp parport rpcsec_gss_krb5 ip_tables x_tables autofs4
[Mon Apr  4 15:47:47 2022] CR2: 00000000000000b1
[Mon Apr  4 15:47:47 2022] ---[ end trace 88b328c764bd13e1 ]---
[Mon Apr  4 15:47:47 2022] RIP: 0010:_nv031719rm+0x79/0x940 [nvidia]
[Mon Apr  4 15:47:47 2022] Code: 07 00 00 41 bf 01 00 00 00 4c 8d 65 48 31 db 44 89 7d 10 66 0f 1f 44 00 00 41 f6 c5 01 0f 84 90 00 00 00 49 8b 86 30 1a 00 00 <80> b8 b1 00 00 00 00 74 12 b8 01 00 00 00 89 d9 d3 e0 41 85 86 94
[Mon Apr  4 15:47:47 2022] RSP: 0018:ffffb6ef0e1cf980 EFLAGS: 00010202
[Mon Apr  4 15:47:47 2022] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000003
[Mon Apr  4 15:47:47 2022] RDX: ffff9d4ad2668008 RSI: ffff9d4aed3b6008 RDI: ffff9d4aeab44008
[Mon Apr  4 15:47:47 2022] RBP: ffff9d4aed9d3ac0 R08: ffff9d4aed9d3b93 R09: ffff9d4aed9d3ba4
[Mon Apr  4 15:47:47 2022] R10: ffffffffc0f13790 R11: 0000000000000000 R12: ffff9d4aed9d3b08
[Mon Apr  4 15:47:47 2022] R13: 000000000000000f R14: ffff9d4aed3b6008 R15: 0000000000000001
[Mon Apr  4 15:47:47 2022] FS:  00007f61db889b80(0000) GS:ffff9d892ed80000(0000) knlGS:0000000000000000
[Mon Apr  4 15:47:47 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Apr  4 15:47:47 2022] CR2: 00000000000000b1 CR3: 000000010e16e001 CR4: 0000000000770ee0
[Mon Apr  4 15:47:47 2022] PKRU: 55555554