Linux always crash when run the python program using gpu

Description

When running multiple AI programs continuously, server freezes may occur when video file operations are involved, usually in a multi concurrency development environment

I found the error info in the /var/opt/syslog


2024-05-15T10:09:25.118508+08:00 trana6b-Default-string kernel: [ 826.919953] Call Trace:
2024-05-15T10:09:25.118508+08:00 trana6b-Default-string kernel: [ 826.919953]
2024-05-15T10:09:25.118509+08:00 trana6b-Default-string kernel: [ 826.919954] dump_stack_lvl+0x48/0x70
2024-05-15T10:09:25.118509+08:00 trana6b-Default-string kernel: [ 826.919956] dump_stack+0x10/0x20
2024-05-15T10:09:25.118509+08:00 trana6b-Default-string kernel: [ 826.919957] __ubsan_handle_out_of_bounds+0xc6/0x110
2024-05-15T10:09:25.118509+08:00 trana6b-Default-string kernel: [ 826.919959] merge_gpu_chunk+0xc6/0x1d0 [nvidia_uvm]
2024-05-15T10:09:25.118510+08:00 trana6b-Default-string kernel: [ 826.919987] free_chunk_with_merges+0x13d/0x180 [nvidia_uvm]
2024-05-15T10:09:25.118510+08:00 trana6b-Default-string kernel: [ 826.920013] free_chunk+0xa4/0xd0 [nvidia_uvm]
2024-05-15T10:09:25.118510+08:00 trana6b-Default-string kernel: [ 826.920039] uvm_pmm_gpu_free+0xbf/0xf0 [nvidia_uvm]
2024-05-15T10:09:25.118510+08:00 trana6b-Default-string kernel: [ 826.920064] phys_mem_deallocate+0x33/0xd0 [nvidia_uvm]
2024-05-15T10:09:25.118511+08:00 trana6b-Default-string kernel: [ 826.920093] uvm_page_tree_put_ptes_async+0x4d5/0x580 [nvidia_uvm]
2024-05-15T10:09:25.118511+08:00 trana6b-Default-string kernel: [ 826.920123] uvm_page_table_range_vec_deinit+0x3e/0xd0 [nvidia_uvm]
2024-05-15T10:09:25.118511+08:00 trana6b-Default-string kernel: [ 826.920151] uvm_ext_gpu_map_destroy+0xd7/0x1f0 [nvidia_uvm]
2024-05-15T10:09:25.118511+08:00 trana6b-Default-string kernel: [ 826.920176] uvm_va_range_destroy+0x324/0x590 [nvidia_uvm]
2024-05-15T10:09:25.118511+08:00 trana6b-Default-string kernel: [ 826.920203] ? _nv025923rm+0x2b/0xf0 [nvidia]
2024-05-15T10:09:25.118512+08:00 trana6b-Default-string kernel: [ 826.920401] ? _nv043203rm+0xe9/0x1c0 [nvidia]
2024-05-15T10:09:25.118512+08:00 trana6b-Default-string kernel: [ 826.920648] uvm_api_free+0x188/0x320 [nvidia_uvm]
2024-05-15T10:09:25.118512+08:00 trana6b-Default-string kernel: [ 826.920667] uvm_ioctl+0xf6e/0x1cd0 [nvidia_uvm]
2024-05-15T10:09:25.118512+08:00 trana6b-Default-string kernel: [ 826.920683] ? _raw_spin_lock_irqsave+0xe/0x20
2024-05-15T10:09:25.118513+08:00 trana6b-Default-string kernel: [ 826.920684] ? os_acquire_spinlock+0x12/0x30 [nvidia]
2024-05-15T10:09:25.118513+08:00 trana6b-Default-string kernel: [ 826.920828] ? os_release_spinlock+0x1a/0x30 [nvidia]
2024-05-15T10:09:25.118513+08:00 trana6b-Default-string kernel: [ 826.920970] ? _nv047682rm+0xed/0x1d0 [nvidia]
2024-05-15T10:09:25.118513+08:00 trana6b-Default-string kernel: [ 826.921113] ? _nv043407rm+0x77/0xd0 [nvidia]
2024-05-15T10:09:25.118513+08:00 trana6b-Default-string kernel: [ 826.921263] ? _nv011756rm+0x86/0xa0 [nvidia]
2024-05-15T10:09:25.118514+08:00 trana6b-Default-string kernel: [ 826.921413] ? _raw_spin_lock_irqsave+0xe/0x20
2024-05-15T10:09:25.118514+08:00 trana6b-Default-string kernel: [ 826.921414] ? _raw_spin_lock_irqsave+0xe/0x20
2024-05-15T10:09:25.118514+08:00 trana6b-Default-string kernel: [ 826.921415] ? thread_context_non_interrupt_add+0x13a/0x2c0 [nvidia_uvm]
2024-05-15T10:09:25.118514+08:00 trana6b-Default-string kernel: [ 826.921439] uvm_unlocked_ioctl_entry.part.0+0x7b/0xf0 [nvidia_uvm]
2024-05-15T10:09:25.118515+08:00 trana6b-Default-string kernel: [ 826.921455] ? nvidia_ioctl+0x369/0x8a0 [nvidia]
2024-05-15T10:09:25.118515+08:00 trana6b-Default-string kernel: [ 826.921595] ? kfree+0x78/0x120
2024-05-15T10:09:25.118515+08:00 trana6b-Default-string kernel: [ 826.921596] ? nvidia_ioctl+0x369/0x8a0 [nvidia]
2024-05-15T10:09:25.118515+08:00 trana6b-Default-string kernel: [ 826.921736] uvm_unlocked_ioctl_entry+0x6b/0x90 [nvidia_uvm]
2024-05-15T10:09:25.118516+08:00 trana6b-Default-string kernel: [ 826.921752] __x64_sys_ioctl+0xa0/0xf0
2024-05-15T10:09:25.118516+08:00 trana6b-Default-string kernel: [ 826.921753] do_syscall_64+0x59/0x90
2024-05-15T10:09:25.118516+08:00 trana6b-Default-string kernel: [ 826.921754] ? syscall_exit_to_user_mode+0x37/0x60
2024-05-15T10:09:25.118516+08:00 trana6b-Default-string kernel: [ 826.921756] ? do_syscall_64+0x68/0x90
2024-05-15T10:09:25.118516+08:00 trana6b-Default-string kernel: [ 826.921757] ? rcu_core_si+0xe/0x20
2024-05-15T10:09:25.118517+08:00 trana6b-Default-string kernel: [ 826.921757] ? __do_softirq+0xd6/0x346
2024-05-15T10:09:25.118517+08:00 trana6b-Default-string kernel: [ 826.921759] ? hrtimer_interrupt+0x11f/0x250
2024-05-15T10:09:25.118517+08:00 trana6b-Default-string kernel: [ 826.921759] ? exit_to_user_mode_prepare+0x30/0xb0
2024-05-15T10:09:25.118517+08:00 trana6b-Default-string kernel: [ 826.921761] ? irqentry_exit_to_user_mode+0x17/0x20
2024-05-15T10:09:25.118519+08:00 trana6b-Default-string kernel: [ 826.921762] ? irqentry_exit+0x43/0x50
2024-05-15T10:09:25.118520+08:00 trana6b-Default-string kernel: [ 826.921763] ? sysvec_apic_timer_interrupt+0x4b/0xd0
2024-05-15T10:09:25.118520+08:00 trana6b-Default-string kernel: [ 826.921764] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
2024-05-15T10:09:25.118520+08:00 trana6b-Default-string kernel: [ 826.921765] RIP: 0033:0x7fd25c72396f
2024-05-15T10:09:25.118520+08:00 trana6b-Default-string kernel: [ 826.921769] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
2024-05-15T10:09:25.118521+08:00 trana6b-Default-string kernel: [ 826.921770] RSP: 002b:00007fd2111db6d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
2024-05-15T10:09:25.118521+08:00 trana6b-Default-string kernel: [ 826.921770] RAX: ffffffffffffffda RBX: 00007fd048ef01c0 RCX: 00007fd25c72396f
2024-05-15T10:09:25.118521+08:00 trana6b-Default-string kernel: [ 826.921771] RDX: 00007fd2111db730 RSI: 0000000000000022 RDI: 0000000000000005
2024-05-15T10:09:25.118530+08:00 trana6b-Default-string kernel: [ 826.921771] RBP: 00007fd2111db780 R08: 0000000000000000 R09: 0000000000000000
2024-05-15T10:09:25.118530+08:00 trana6b-Default-string kernel: [ 826.921772] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
2024-05-15T10:09:25.118530+08:00 trana6b-Default-string kernel: [ 826.921772] R13: 00007fd048ef01c0 R14: 00007fd2111db730 R15: 0000000000000005
2024-05-15T10:09:25.118530+08:00 trana6b-Default-string kernel: [ 826.921773]


## Environment

**TensorRT Version**: 
**GPU Type**:  NVIDIA RTX A6000
**Nvidia Driver Version**: 
**CUDA Version**: Driver Version: 535.154.05   CUDA Version: 12.2
**CUDNN Version**: Build cuda_11.8.r11.8/compiler.31833905_0
**Operating System + Version**:  ubuntu 23.01 
**Python Version (if applicable)**:  many version 
**TensorFlow Version (if applicable)**:  no tensorflow 
**PyTorch Version (if applicable)**:  many version 
**Baremetal or Container (if container which image + tag)**:  no 


## Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

## Steps To Reproduce

<!-- Craft a minimal bug report following this guide - https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports -->

Please include:
  * Exact steps/commands to build your repro
  * Exact steps/commands to run your repro
  * Full traceback of errors encountered

Hi @17702069165 ,
This forum talks about issues related to TRT.
I am afraid, i might not have the correct answer to the query

Hey , thanks for replying , I didn’t use the TRT, all just pure pytorch .pth model

Hi @17702069165 ,
You may reach out to Issues · pytorch/pytorch · GitHub
thanks

I will check this out , thx anyway