Argus image acquisition crashes after a few days

Hi all,

Ihave a system which acquires images from 2 cameras ~30fps. The image is acquired using libargus. After a few days of continuous running my application (without stressing the system) I get following error:

SCF: Error InvalidState:  Corr Error Received for sensor 2 .. Continuing!
Mär 09 15:15:26   (in src/services/capture/FusaCaptureViCsiHw.cpp, function waitCsiFrameEnd(), line 643)
Mär 09 15:15:26  SCF: Error ResourceAlreadyInUse:  (propagating from src/services/capture/FusaCaptureViCsiHw.cpp, function startCaptureInternal(), line 866)
Mär 09 15:15:26  SCF: Error ResourceAlreadyInUse:  (propagating from src/services/capture/CaptureRecord.cpp, function doCSItoMemCapture(), line 536)
Mär 09 15:15:26  SCF: Error ResourceAlreadyInUse:  (propagating from src/services/capture/CaptureRecord.cpp, function issueCapture(), line 483)
Mär 09 15:15:26  SCF: Error ResourceAlreadyInUse:  (propagating from src/services/capture/CaptureServiceDevice.cpp, function issueCaptures(), line 1530)
Mär 09 15:15:26  SCF: Error ResourceAlreadyInUse:  (propagating from src/services/capture/CaptureServiceDevice.cpp, function issueCaptures(), line 1359)
Mär 09 15:15:26  SCF: Error ResourceAlreadyInUse:  (propagating from src/common/Utils.cpp, function workerThread(), line 114)
Mär 09 15:15:26  SCF: Error ResourceAlreadyInUse: Worker thread CaptureScheduler frameStart failed (in src/common/Utils.cpp, function workerThread(), line 133)
Mär 09 15:15:26  SCF: Error Timeout:  (propagating from src/api/Buffer.cpp, function waitForUnlock(), line 644)
Mär 09 15:15:26  SCF: Error Timeout:  (propagating from src/components/CaptureContainerImpl.cpp, function returnBuffer(), line 426)
Mär 09 15:15:26  SCF: Error InvalidState: Capture Scheduler not running (in src/services/capture/CaptureServiceDevice.cpp, function addNewItemToSchedule(), line 1004)
Mär 09 15:15:26  SCF: Error InvalidState:  (propagating from src/services/capture/CaptureService.cpp, function addRequest(), line 411)
Mär 09 15:15:26  SCF: Error InvalidState:  (propagating from src/components/stages/MemoryToISPCaptureStage.cpp, function doHandleRequest(), line 144)
Mär 09 15:15:26  SCF: Error InvalidState:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 158)
Mär 09 15:15:26  SCF: Error InvalidState: Sending critical error event for Session 0
Mär 09 15:15:26   (in src/api/Session.cpp, function sendErrorEvent(), line 1039)
Mär 09 15:15:26  SCF: Error InvalidState: Capture Scheduler not running (in src/services/capture/CaptureServiceDevice.cpp, function addNewItemToSchedule(), line 1004)
Mär 09 15:15:26  SCF: Error InvalidState:  (propagating from src/services/capture/CaptureService.cpp, function addRequest(), line 411)
Mär 09 15:15:26  SCF: Error InvalidState:  (propagating from src/components/stages/SensorCaptureStage.cpp, function doHandleRequest(), line 87)
Mär 09 15:15:26  SCF: Error InvalidState:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 158)

my application crashes because the iCaptureSession->isRepeating() fails, which is not recoverable for me.

Is this a known stability issue??

I use JetPack 5.1.2 and argus version is : 0.99.3.3 (multi-process)

I hope someone can help

Edit:

just checked the dmesg log and Im not 100% sure if this is related to this error but it seems like the gpu had some kind of dma error??

[ 7620.240160] __ga10b__
[ 7620.258802] __ga10b__ PBDMA Status - chip ga10b
[ 7620.261420] __ga10b__ -------------------------
[ 7620.265977] __ga10b__ pbdma 0:
[ 7620.270525] __ga10b__   id: -1 - [channel] next_id: - -1 [channel] | status: invalid
[ 7620.273678] __ga10b__   PBDMA_PUT 0000002004300050 PBDMA_GET 0000002004300050
[ 7620.281552] __ga10b__   GP_PUT    00000008  GP_GET  00000008  FETCH   00000008 HEADER 2140006c
[ 7620.288815] __ga10b__   HDR       2001001b  SHADOW0 04300028  SHADOW1 00002820
[ 7620.297300] __ga10b__ pbdma 1:
[ 7620.304474] __ga10b__   id: 0 - [tsg]     next_id: - -1 [channel] | status: valid
[ 7620.307454] __ga10b__   PBDMA_PUT 0000000200860308 PBDMA_GET 00000002008602f0
[ 7620.315067] __ga10b__   GP_PUT    00000002  GP_GET  00000002  FETCH   00000002 HEADER 20111b00
[ 7620.322415] __ga10b__   HDR       200426c0  SHADOW0 00860294  SHADOW1 00007602
[ 7620.330814] __ga10b__ pbdma 2:
[ 7620.337814] __ga10b__   id: 1 - [tsg]     next_id: - -1 [channel] | status: valid
[ 7620.340877] __ga10b__   PBDMA_PUT 0000000200c60d04 PBDMA_GET 0000000200c60d04
[ 7620.348404] __ga10b__   GP_PUT    00000038  GP_GET  00000038  FETCH   00000038 HEADER 21540300
[ 7620.355579] __ga10b__   HDR       200180c0  SHADOW0 00c60cc8  SHADOW1 00003e02
[ 7620.364238] __ga10b__ pbdma 3:
[ 7620.371499] __ga10b__   id: -1 - [channel] next_id: - -1 [channel] | status: invalid
[ 7620.374553] __ga10b__   PBDMA_PUT 0000008410c05818 PBDMA_GET 000000a070a06f44
[ 7620.382353] __ga10b__   GP_PUT    00000000  GP_GET  fe15a9a7  FETCH   00000000 HEADER a1c63f50
[ 7620.389526] __ga10b__   HDR       1f0910e2  SHADOW0 61a88841  SHADOW1 40027697
[ 7620.398188] __ga10b__ pbdma 4:
[ 7620.405451] __ga10b__   id: -1 - [channel] next_id: - -1 [channel] | status: invalid
[ 7620.408515] __ga10b__   PBDMA_PUT 0000005938981000 PBDMA_GET 0000007f620df3a0
[ 7620.416303] __ga10b__   GP_PUT    00000000  GP_GET  39d98750  FETCH   00000000 HEADER a102ae94
[ 7620.423477] __ga10b__   HDR       000153c2  SHADOW0 a8395844  SHADOW1 949ce29f
[ 7620.432138] __ga10b__ pbdma 5:
[ 7620.439401] __ga10b__   id: 2 - [tsg]     next_id: - -1 [channel] | status: valid
[ 7620.442454] __ga10b__   PBDMA_PUT 0000000201260020 PBDMA_GET 0000000201260020
[ 7620.449990] __ga10b__   GP_PUT    00000001  GP_GET  00000001  FETCH   00000001 HEADER 21540300
[ 7620.457164] __ga10b__   HDR       200180c0  SHADOW0 01060188  SHADOW1 00003e02
[ 7620.465823] __ga10b__
[ 7620.473092] __ga10b__ ga10b eng 0:
[ 7620.475627] __ga10b__ id: 0 (tsg), next_id: -1 (channel), ctx status: valid
[ 7620.479298] __ga10b__ busy
[ 7620.486300] __ga10b__
[ 7620.489187] __ga10b__ ga10b eng 1:
[ 7620.491637] __ga10b__ id: -1 (channel), next_id: -1 (channel), ctx status: invalid
[ 7620.495310] __ga10b__
[ 7620.503189] __ga10b__ ga10b eng 2:
[ 7620.505712] __ga10b__ id: -1 (channel), next_id: -1 (channel), ctx status: invalid
[ 7620.509310] __ga10b__
[ 7620.517100] __ga10b__ ga10b eng 3:
[ 7620.519726] __ga10b__ id: 1 (tsg), next_id: -1 (channel), ctx status: valid
[ 7620.523223] __ga10b__
[ 7620.530313] __ga10b__ ga10b eng 4:
[ 7620.532938] __ga10b__ id: -1 (channel), next_id: -1 (channel), ctx status: invalid
[ 7620.536610] __ga10b__
[ 7620.544225] __ga10b__ ga10b eng 5:
[ 7620.546673] __ga10b__ id: -1 (channel), next_id: -1 (channel), ctx status: invalid
[ 7620.550337] __ga10b__
[ 7620.557960] __ga10b__
[ 7620.576041] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:55   [ERR]  Error reporting is not supported in this platform
[ 7620.576431] nvgpu: 17000000.ga10b gv11b_mm_mmu_fault_handle_buf_valid_entry:525  [ERR]  page fault error: err_type = 0x8, fault_status = 0x200
[ 7620.587377] nvgpu: 17000000.ga10b      gv11b_fb_mmu_fault_info_dump:294  [ERR]  [MMU FAULT] mmu engine id:  65, ch id:  492, fault addr: 0x205a1c000, fault addr aperture: 0, fault type: invalid pde, access type: virt write,
[ 7620.607060] nvgpu: 17000000.ga10b      gv11b_fb_mmu_fault_info_dump:307  [ERR]  [MMU FAULT] protected mode: 0, client type: gpc, client id:  t1_0, gpc id if client type is gpc: 0,
[ 7646.933258] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:55   [ERR]  Error reporting is not supported in this platform
[ 7646.933626] nvgpu: 17000000.ga10b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(0), offset(0)
[ 7646.934034] nvgpu: 17000000.ga10b nvgpu_gr_intr_handle_sm_exception:390  [ERR]  could not pre-process sm error!
[ 7646.934343] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:55   [ERR]  Error reporting is not supported in this platform
[ 7646.934714] nvgpu: 17000000.ga10b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(0), offset(0)
[ 7646.935098] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:55   [ERR]  Error reporting is not supported in this platform
[ 7646.935432] nvgpu: 17000000.ga10b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(1), offset(2048)
[ 7646.936986] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:55   [ERR]  Error reporting is not supported in this platform
[ 7646.938785] nvgpu: 17000000.ga10b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(1), offset(2048)
[ 7646.949388] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:55   [ERR]  Error reporting is not supported in this platform
[ 7646.961665] nvgpu: 17000000.ga10b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(2), offset(4096)
[ 7646.974045] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:55   [ERR]  Error reporting is not supported in this platform
[ 7646.986347] nvgpu: 17000000.ga10b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(2), offset(4096)
[ 7646.998736] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:55   [ERR]  Error reporting is not supported in this platform
[ 7647.011019] nvgpu: 17000000.ga10b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(3), offset(6144)
[ 7647.023393] nvgpu: 17000000.ga10b nvgpu_cic_mon_report_err_safety_services:55   [ERR]  Error reporting is not supported in this platform
[ 7647.035693] nvgpu: 17000000.ga10b nvgpu_gr_intr_handle_sm_exception:365  [ERR]  sm machine check err. gpc_id(0), tpc_id(3), offset(6144)
[ 7647.048094] nvgpu: 17000000.ga10b gr_intr_handle_exception_interrupts:759  [ERR]  set gr exception notifier
[ 7647.058023] nvgpu: 17000000.ga10b     nvgpu_set_err_notifier_locked:149  [ERR]  error notifier set to 13 for ch 492
[ 7647.068615] __ga10b__ Channel Status - chip ga10b

Replace the libs to try.

libnvfusacap_35.4.1.so.txt (193.6 KB)
libnvargus.so.0201.txt (1.2 MB)
libnvscf.so.0201.txt (8.4 MB)

I started the test with the changed librarys, it might take several hours until the error pops up.

Can you explain what you changed in those librarys?

Some fixes for stability.

Thanks

Testprogram now runs for ~2.5 Days which is longer than it did bevor - not calling it fixed yet but seems more stable than bevor! Thank you.

Will this fix be available in higher JetPack versions?

Suppose yes.

we have run extensive tests over several days and cant see the error anymore. I declare this as fixed

@ShaneCCC do you have the updated libraries for Jetpack 5.1.1?

Sorry for no.
Please update the r35.4.1

Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.