Centos7.9 nvi-driver 470.161.03 random kernel crash

hello everyone, I use GPU for AI, but kernel crash everyday. I’ve updated the latest kernel and nvidia driver from 465.19.01 to 470.161.03. I don’t know what can I do?
here is content in vmcore-dmesg.txt
[235992.629556] NVRM: GPU at PCI:0000:d5:00: GPU-6d04fd5f-9453-afde-2202-86694ec2b11a
[235992.629565] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.631521] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.633332] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.635092] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.636867] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.638533] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.640212] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.641939] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.643674] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.645349] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.647122] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.648856] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.650659] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.652385] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.654115] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.655860] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.657559] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.659247] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.660978] NVRM: Xid (PCI:0000:d5:00): 62, pid=43667, 0000(0000) 00000000 00000000
[235992.661288] BUG: unable to handle kernel paging request at fffffffd220b46c0
[235992.661362] IP: [] cpuacct_charge+0x2d/0x50
[235992.661427] PGD 5240014067 PUD 0
[235992.661463] Thread overran stack, or stack corrupted
[235992.661508] Oops: 0000 [#1] SMP
[235992.661544] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc dm_mirror dm_region_hash dm_log dm_mod nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(OE) nvidia(POE) vfat fat iTCO_wdt iTCO_vendor_support ext4 mbcache jbd2 snd_hda_codec_hdmi i10nm_edac nfit libnvdimm coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul snd_hda_intel glue_helper ablk_helper cryptd snd_hda_codec pcspkr snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore joydev sg mei_me mei i2c_i801 ipmi_si ipmi_devintf ipmi_msghandler pinctrl_lewisburg pinctrl_intel acpi_cpufreq acpi_power_meter ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic ast drm_kms_helper syscopyarea sysfillrect sysimgblt
[235992.662272] fb_sys_fops ttm drm ahci crct10dif_pclmul crct10dif_common crc32c_intel igb libahci libata rndis_host cdc_ether ptp usbnet pps_core dca mii drm_panel_orientation_quirks i2c_algo_bit wmi
[235992.662452] CPU: 63 PID: 43667 Comm: irq/182-nvidia Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.83.1.el7.x86_64 #1
[235992.667116] Hardware name: Supermicro SYS-420GP-TNR/X12DPG-OA6, BIOS 1.20 04/17/2022
[235992.669303] task: ffffa1ca38e36300 ti: ffffa1ca15fcc000 task.ti: ffffa1ca15fcc000
[235992.671488] RIP: 0010:[] [] cpuacct_charge+0x2d/0x50
[235992.674040] RSP: 0018:ffffa1ca3f5c3e18 EFLAGS: 00010046
[235992.676598] RAX: 0000000000015c60 RBX: ffffa1ca39b37980 RCX: ffffffffae5ec5b8
[235992.679159] RDX: ffffffffaf05d220 RSI: 00000000000f44aa RDI: ffffa1ca38e36300
[235992.681305] RBP: ffffa1ca3f5c3e18 R08: 000000000000039d R09: 000000000000040f
[235992.683453] R10: 0000000000000004 R11: 0000000000000005 R12: ffffa1ca38e36300
[235992.685592] R13: 000000000000003f R14: 00000000000f44aa R15: ffffa1ca3f5dacc0
[235992.687711] FS: 0000000000000000(0000) GS:ffffa1ca3f5c0000(0000) knlGS:0000000000000000
[235992.689849] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[235992.691992] CR2: fffffffd220b46c0 CR3: 0000007f78f42000 CR4: 0000000000760fe0
[235992.694147] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[235992.696292] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[235992.698433] PKRU: 00000000
[235992.700545] Call Trace:
[235992.702635]
[235992.702659] [] update_curr_rt+0xdb/0x2c0
[235992.706753] [] task_tick_rt+0x14/0x130
[235992.709194] [] scheduler_tick+0xd4/0x160
[235992.711257] [] ? tick_sched_do_timer+0x60/0x60
[235992.713293] [] update_process_times+0x65/0x80
[235992.715314] [] tick_sched_handle+0x30/0x80
[235992.717328] [] tick_sched_timer+0x39/0x80
[235992.719316] [] __hrtimer_run_queues+0x10e/0x270
[235992.721284] [] hrtimer_interrupt+0xaf/0x1e0
[235992.723247] [] local_apic_timer_interrupt+0x3b/0x70
[235992.725233] [] smp_apic_timer_interrupt+0x43/0x60
[235992.727204] [] apic_timer_interrupt+0x172/0x180
[235992.729122]
[235992.729145] [] ? __vmalloc_node_range+0xc8/0x280
[235992.733349] [] ? _nv033090rm+0x37/0x70 [nvidia]
[235992.736211] [] ? _nv011256rm+0x19e/0x1d0 [nvidia]
[235992.739018] [] ? _nv035926rm+0x6a/0x90 [nvidia]
[235992.741805] [] ? _nv021156rm+0x19/0x30 [nvidia]
[235992.744568] [] ? _nv021292rm+0x2b/0xa0 [nvidia]
[235992.747306] [] ? _nv021260rm+0x20/0x60 [nvidia]
[235992.749734] [] ? _nv009645rm+0xcf/0x420 [nvidia]
[235992.752325] [] ? _nv009650rm+0x7a/0x120 [nvidia]
[235992.754932] [] ? _nv009828rm+0x302/0x490 [nvidia]
[235992.757471] [] ? _nv031585rm+0x44/0x220 [nvidia]
[235992.759948] [] ? _nv031579rm+0xae/0x1a0 [nvidia]
[235992.762344] [] ? _nv031581rm+0x119/0x270 [nvidia]
[235992.764064] [] ? _nv035845rm+0x5a/0x100 [nvidia]
[235992.766003] [] ? _nv021370rm+0x32/0x90 [nvidia]
[235992.767915] [] ? _nv035154rm+0x69/0x110 [nvidia]
[235992.769775] [] ? _nv035158rm+0x121/0x340 [nvidia]
[235992.771636] [] ? _nv029415rm+0x707/0xd90 [nvidia]

The gpu might be overheating or simply be broken.