Probabilistic cuda error when TrtEngine loading : Error Code 1: Cuda Runtime (an illegal memory access was encountered)

jetpack 5.1.1 for NX

R35 (release), REVISION: 3.1, GCID: 32827747, BOARD: t186ref, EABI: aarch64, DATE: Sun Mar 19 15:19:21 UTC 2023

dmesg message:

[ 1541.196961] nvgpu: 17000000.gv11b           nvgpu_report_err_to_sdl:66   [ERR]  Failed to report an error: hw_unit_id = 0x1, err_id=0xa, ss_err_id = 0x2a1
[ 1541.197374] nvgpu: 17000000.gv11b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(0), offset(0)
[ 1541.197727] nvgpu: 17000000.gv11b nvgpu_gr_intr_handle_sm_exception:390  [ERR]  could not pre-process sm error!
[ 1541.197997] nvgpu: 17000000.gv11b           nvgpu_report_err_to_sdl:66   [ERR]  Failed to report an error: hw_unit_id = 0x1, err_id=0xa, ss_err_id = 0x2a1
[ 1541.198349] nvgpu: 17000000.gv11b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(0), offset(0)
[ 1541.198780] nvgpu: 17000000.gv11b           nvgpu_report_err_to_sdl:66   [ERR]  Failed to report an error: hw_unit_id = 0x1, err_id=0xa, ss_err_id = 0x2a1
[ 1541.199139] nvgpu: 17000000.gv11b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(1), offset(2048)
[ 1541.199461] nvgpu: 17000000.gv11b           nvgpu_report_err_to_sdl:66   [ERR]  Failed to report an error: hw_unit_id = 0x1, err_id=0xa, ss_err_id = 0x2a1
[ 1541.207373] nvgpu: 17000000.gv11b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(1), offset(2048)
[ 1541.220172] nvgpu: 17000000.gv11b           nvgpu_report_err_to_sdl:66   [ERR]  Failed to report an error: hw_unit_id = 0x1, err_id=0xa, ss_err_id = 0x2a1
[ 1541.233706] nvgpu: 17000000.gv11b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(2), offset(4096)
[ 1541.246179] nvgpu: 17000000.gv11b           nvgpu_report_err_to_sdl:66   [ERR]  Failed to report an error: hw_unit_id = 0x1, err_id=0xa, ss_err_id = 0x2a1
[ 1541.260016] nvgpu: 17000000.gv11b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(2), offset(4096)
[ 1541.272536] nvgpu: 17000000.gv11b gr_intr_handle_exception_interrupts:759  [ERR]  set gr exception notifier
[ 1541.282037] nvgpu: 17000000.gv11b     nvgpu_set_err_notifier_locked:149  [ERR]  error notifier set to 13 for ch 499
[ 1541.292819] __gv11b__ Channel Status - chip gv11b
[ 1541.292841] __gv11b__ ---------------------------
[ 1541.297197] __gv11b__ 494-gv11b, TSG: 9, pid 12735, refs: 2, deterministic: yes, domain name: (default)
[ 1541.301969] __gv11b__ channel status:  in use idle not busy
[ 1541.311406] __gv11b__ RAMFC: TOP: 000000000000 PUT: 000200960aac GET: 000200960aac FETCH: 020200960aac HEADER: 60400000 COUNT: 84000000 SEMAPHORE: addr 000000000000 payload 0000000000000000 execute 00000000
[ 1541.317393] __gv11b__  
[ 1541.336077] __gv11b__ 495-gv11b, TSG: 9, pid 12735, refs: 2, deterministic: yes, domain name: (default)
[ 1541.338603] __gv11b__ channel status:  in use idle not busy
[ 1541.348123] __gv11b__ RAMFC: TOP: 000000000000 PUT: 000200861814 GET: 000200861814 FETCH: 020200861814 HEADER: 60400000 COUNT: 84000000 SEMAPHORE: addr 000201177fb4 payload 000000000000001a execute 00001003
[ 1541.353414] __gv11b__  
[ 1541.372369] __gv11b__ 496-gv11b, TSG: 8, pid 12735, refs: 2, deterministic: yes, domain name: (default)
[ 1541.374879] __gv11b__ channel status:  in use pending busy
[ 1541.384240] __gv11b__ RAMFC: TOP: 000000000000 PUT: 000200771b18 GET: 000200771b18 FETCH: 020200771b18 HEADER: 60400000 COUNT: 84000000 SEMAPHORE: addr 000201177f8c payload 000000000000002a execute 00000003
[ 1541.389447] __gv11b__  
[ 1541.408002] __gv11b__ 497-gv11b, TSG: 8, pid 12735, refs: 2, deterministic: yes, domain name: (default)
[ 1541.410320] __gv11b__ channel status:  in use pending busy
[ 1541.419818] __gv11b__ RAMFC: TOP: 000000000000 PUT: 000200667de4 GET: 000200667de4 FETCH: 020200667de4 HEADER: 60400000 COUNT: 84000000 SEMAPHORE: addr 00200040d000 payload 000000000006014c execute 00000001
[ 1541.425396] __gv11b__  
[ 1541.443480] __gv11b__ 498-gv11b, TSG: 8, pid 12735, refs: 2, deterministic: yes, domain name: (default)
[ 1541.446304] __gv11b__ channel status:  in use pending busy
[ 1541.455516] __gv11b__ RAMFC: TOP: 000000000000 PUT: 000200606b34 GET: 000200606b34 FETCH: 020200606b34 HEADER: 60400000 COUNT: 84000000 SEMAPHORE: addr 002000405000 payload 000000000006014a execute 00000001
[ 1541.461014] __gv11b__  
[ 1541.479312] __gv11b__ 499-gv11b, TSG: 8, pid 12735, refs: 4, deterministic: yes, domain name: (default)
[ 1541.482185] __gv11b__ channel status:  in use on_pbdma_and_eng busy
[ 1541.491388] __gv11b__ RAMFC: TOP: 000000000000 PUT: 0002004a2298 GET: 0002004a2298 FETCH: 0202004a2298 HEADER: 60400000 COUNT: 84000000 SEMAPHORE: addr 000201177fb4 payload 000000000000002a execute 00000003
[ 1541.497933] __gv11b__  
[ 1541.516224] __gv11b__ 500-gv11b, TSG: 7, pid 9707, refs: 2, deterministic: no, domain name: (default)
[ 1541.518814] __gv11b__ channel status:  in use idle not busy
[ 1541.528309] __gv11b__ RAMFC: TOP: 8000002000417ad0 PUT: 002000417ad0 GET: 002000417ad0 FETCH: 002000417ad0 HEADER: 60400000 COUNT: 80000000 SEMAPHORE: addr 002000408000 payload 0000000000000000 execute 00100001
[ 1541.534086] __gv11b__  
[ 1541.552658] __gv11b__ 501-gv11b, TSG: 6, pid 9421, refs: 2, deterministic: no, domain name: (default)
[ 1541.555213] __gv11b__ channel status:  in use idle not busy
[ 1541.564466] __gv11b__ RAMFC: TOP: 8000002000429258 PUT: 002000429258 GET: 002000429258 FETCH: 002000429258 HEADER: 60400000 COUNT: 80000000 SEMAPHORE: addr 002000428000 payload 0000000000000000 execute 00000001
[ 1541.570212] __gv11b__  
[ 1541.588788] __gv11b__ 502-gv11b, TSG: 5, pid 9421, refs: 2, deterministic: no, domain name: (default)
[ 1541.591614] __gv11b__ channel status:  in use idle not busy
[ 1541.600790] __gv11b__ RAMFC: TOP: 800000200045a6a8 PUT: 00200045a6a8 GET: 00200045a6a8 FETCH: 00200045a6a8 HEADER: 60400000 COUNT: 80000000 SEMAPHORE: addr 002000420000 payload 0000000000000000 execute 00100001
[ 1541.606519] __gv11b__  
[ 1541.625172] __gv11b__ 503-gv11b, TSG: 4, pid 9107, refs: 2, deterministic: no, domain name: (default)
[ 1541.627779] __gv11b__ channel status:  in use idle not busy
[ 1541.636645] __gv11b__ RAMFC: TOP: 8000002000413e80 PUT: 002000413e80 GET: 002000413e80 FETCH: 002000413e80 HEADER: 60400000 COUNT: 80000000 SEMAPHORE: addr 002000408000 payload 0000000000000000 execute 00100001
[ 1541.642425] __gv11b__  
[ 1541.661319] __gv11b__ 504-gv11b, TSG: 3, pid 9087, refs: 2, deterministic: yes, domain name: (default)
[ 1541.663921] __gv11b__ channel status:  in use idle not busy
[ 1541.673070] __gv11b__ RAMFC: TOP: 000000000000 PUT: 000200960020 GET: 000200960020 FETCH: 020200960020 HEADER: 60400000 COUNT: 84000000 SEMAPHORE: addr 000000000000 payload 0000000000000000 execute 00000000
[ 1541.678789] __gv11b__  
[ 1541.697243] __gv11b__ 505-gv11b, TSG: 3, pid 9087, refs: 2, deterministic: yes, domain name: (default)
[ 1541.699793] __gv11b__ channel status:  in use idle not busy
[ 1541.708905] __gv11b__ RAMFC: TOP: 000000000000 PUT: 000200860110 GET: 000200860110 FETCH: 020200860110 HEADER: 60400000 COUNT: 84000000 SEMAPHORE: addr 000000000000 payload 0000000000000000 execute 00000000
[ 1541.714699] __gv11b__  
[ 1541.733056] __gv11b__ 506-gv11b, TSG: 2, pid 9087, refs: 2, deterministic: yes, domain name: (default)
[ 1541.735639] __gv11b__ channel status:  in use idle not busy
[ 1541.744888] __gv11b__ RAMFC: TOP: 000000000000 PUT: 000200760324 GET: 000200760324 FETCH: 020200760324 HEADER: 60400000 COUNT: 84000000 SEMAPHORE: addr 002000411000 payload 0000000000000002 execute 00000001
[ 1541.750374] __gv11b__  
[ 1541.768667] __gv11b__ 507-gv11b, TSG: 2, pid 9087, refs: 2, deterministic: yes, domain name: (default)
[ 1541.771256] __gv11b__ channel status:  in use idle not busy
[ 1541.780493] __gv11b__ RAMFC: TOP: 000000000000 PUT: 000200660324 GET: 000200660324 FETCH: 020200660324 HEADER: 60400000 COUNT: 84000000 SEMAPHORE: addr 00200040d000 payload 0000000000000002 execute 00000001
[ 1541.786243] __gv11b__  
[ 1541.804592] __gv11b__ 508-gv11b, TSG: 2, pid 9087, refs: 2, deterministic: yes, domain name: (default)
[ 1541.807403] __gv11b__ channel status:  in use idle not busy
[ 1541.816651] __gv11b__ RAMFC: TOP: 000000000000 PUT: 000200540324 GET: 000200540324 FETCH: 020200540324 HEADER: 60400000 COUNT: 84000000 SEMAPHORE: addr 002000405000 payload 0000000000000002 execute 00000001
[ 1541.822383] __gv11b__  
[ 1541.840705] __gv11b__ 509-gv11b, TSG: 2, pid 9087, refs: 2, deterministic: yes, domain name: (default)
[ 1541.843266] __gv11b__ channel status:  in use idle not busy
[ 1541.852527] __gv11b__ RAMFC: TOP: 000000000000 PUT: 00020044f27c GET: 00020044f27c FETCH: 02020044f27c HEADER: 60400000 COUNT: 84000000 SEMAPHORE: addr 00020064ffb0 payload 0000000000000002 execute 00001003
[ 1541.858003] __gv11b__  
[ 1541.876577] __gv11b__ 510-gv11b, TSG: 1, pid 8549, refs: 2, deterministic: no, domain name: (default)
[ 1541.878881] __gv11b__ channel status:  in use idle not busy
[ 1541.888110] __gv11b__ RAMFC: TOP: 8000002000429208 PUT: 002000429208 GET: 002000429208 FETCH: 002000429208 HEADER: 60400000 COUNT: 80000000 SEMAPHORE: addr 002000428000 payload 0000000000000000 execute 00000001
[ 1541.894125] __gv11b__  
[ 1541.912690] __gv11b__ 511-gv11b, TSG: 0, pid 8549, refs: 2, deterministic: no, domain name: (default)
[ 1541.915018] __gv11b__ channel status:  in use idle not busy
[ 1541.924546] __gv11b__ RAMFC: TOP: 80000020004436d8 PUT: 0020004436d8 GET: 0020004436d8 FETCH: 0020004436d8 HEADER: 60400000 COUNT: 80000000 SEMAPHORE: addr 002000420000 payload 0000000000000000 execute 00100001
[ 1541.930270] __gv11b__  
[ 1541.948853] __gv11b__ PBDMA Status - chip gv11b
[ 1541.951161] __gv11b__ -------------------------
[ 1541.956042] __gv11b__ pbdma 0:
[ 1541.960550] __gv11b__   id: -1 - [channel] next_id: - -1 [channel] | status: invalid
[ 1541.963668] __gv11b__   PBDMA_PUT 0000002000413e80 PBDMA_GET 0000002000413e80
[ 1541.971597] __gv11b__   GP_PUT    0000061b  GP_GET  0000061b  FETCH   0000061b HEADER 60400000
[ 1541.978847] __gv11b__   HDR       00000000  SHADOW0 00413e58  SHADOW1 00002820
[ 1541.987079] __gv11b__ pbdma 1:
[ 1541.994532] __gv11b__   id: 8 - [tsg]     next_id: - -1 [channel] | status: valid
[ 1541.997483] __gv11b__   PBDMA_PUT 00000002004a2968 PBDMA_GET 00000002004a26a4
[ 1542.005089] __gv11b__   GP_PUT    000002c7  GP_GET  000002c1  FETCH   000002c7 HEADER 601101b4
[ 1542.012439] __gv11b__   HDR       6058206d  SHADOW0 004a25f4  SHADOW1 00037602
[ 1542.020575] __gv11b__ pbdma 2:
[ 1542.027585] __gv11b__   id: 9 - [tsg]     next_id: - -1 [channel] | status: valid
[ 1542.030893] __gv11b__   PBDMA_PUT 0000000200861994 PBDMA_GET 0000000200861994
[ 1542.038425] __gv11b__   GP_PUT    0000006c  GP_GET  0000006c  FETCH   0000006c HEADER 60400000
[ 1542.045607] __gv11b__   HDR       00000000  SHADOW0 00861958  SHADOW1 00003e02
[ 1542.054256] __gv11b__  
[ 1542.061523] __gv11b__ gv11b eng 0: 
[ 1542.064056] __gv11b__ id: 8 (tsg), next_id: -1 (channel), ctx status: valid 
[ 1542.067699] __gv11b__ busy 
[ 1542.074758] __gv11b__  
[ 1542.077364] __gv11b__ gv11b eng 1: 
[ 1542.079819] __gv11b__ id: 9 (tsg), next_id: -1 (channel), ctx status: valid 
[ 1542.083788] __gv11b__  
[ 1542.090841] __gv11b__ gv11b eng 2: 
[ 1542.093468] __gv11b__ id: -1 (channel), next_id: -1 (channel), ctx status: invalid 
[ 1542.096971] __gv11b__  
[ 1542.104742] __gv11b__ gv11b eng 3: 
[ 1542.107338] __gv11b__ id: -1 (channel), next_id: -1 (channel), ctx status: invalid 
[ 1542.111049] __gv11b__  
[ 1542.118768] __gv11b__  
[ 1542.126983] nvgpu: 17000000.gv11b           nvgpu_report_err_to_sdl:66   [ERR]  Failed to report an error: hw_unit_id = 0x9, err_id=0x8, ss_err_id = 0x289
[ 1542.137742] nvgpu: 17000000.gv11b gv11b_mm_mmu_fault_handle_buf_valid_entry:525  [ERR]  page fault error: err_type = 0x8, fault_status = 0x200
[ 1542.150117] nvgpu: 17000000.gv11b      gv11b_fb_mmu_fault_info_dump:294  [ERR]  [MMU FAULT] mmu engine id:  65, ch id:  499, fault addr: 0x212617000, fault addr aperture: 0, fault type: invalid pte, access type: virt write, 
[ 1542.169884] nvgpu: 17000000.gv11b      gv11b_fb_mmu_fault_info_dump:307  [ERR]  [MMU FAULT] protected mode: 0, client type: gpc, client id:  t1 1, gpc id if client type is gpc: 0, 

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

Sorry for the late update.
Could you share the steps/source to reproduce this issue?

Thanks.