Argus errors with high cpu load and subsequent issues

This is a continuation of issues discussed in https://devtalk.nvidia.com/default/topic/1051530/jetson-tx2/argus-daemon-errors-max-frames-acquired/.

The problem is that Argus can hit an error (the errors seem to be ISP related), and become unrecoverable or segfault. After an error happens, these two messages are repeated at a very high rate:

SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 667)
(Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 109)

I’ve attached a program, stream_all, that reproduces the issue. It streams all cameras and performs a few calculations on the images. Every 20 seconds it prints the fps for each camera and some other info. Build and run instructions are at the top of stream_all.cpp.

https://drive.google.com/file/d/16MUBYSJV9X7L7MwKpC0HP4rGQognRbVG/view?usp=sharing

To repo, run the stream_all program, along with stress --cpu X (stress can be installed via sudo apt-get install stress). In my case, there are 6 cameras running at 30fps with 1280x1080 resolution. With X >= 4 an error occurs pretty quickly. You might need to adjust X based on the number of cameras. I’m building against the single-process Argus library.

Currently, I have the acquireFrame() calls set to time out after 10 seconds. After an error occurs, the two messages shown above repeat until acquireFrame times out for the problematic camera(s). The program then attempts to shutdown the camera with stopRepeat() + waitForIdle(). Two things can happen:

  • It works. The repeating error messages also stop. However, trying to restart capture requests results in more errors. Have to kill the CaptureSession and recreate it.
  • It fails. This usually results in a segmentation fault.

Summarizing the issues:

  • ISP related errors happen with no programmatic way to detect them - other then waiting for acquireFrame() time out. The error frequency increases with cpu load.
  • Error messages repeat at a very high rate (> 5 kHz) after an error occurs. This makes finding the relevant messages difficult and fills up log files. It also uses a lot of compute. If no time out is used with acquireFrame() they can repeat seemingly forever.
  • Only way to recover is to let acquireFrame time out, kill and restart the CaptureSession. That only works if it doesn't segfault.

These are the two types of errors I’ve seen:

SCF: Error NotSupported: AMR Sample data type is error, requested type is IspRawStats* (in src/components/amr/Sample.cpp, function typeError(), line 65)
SCF: Error NotSupported:  (in src/components/amr/Sample.cpp, function get(), line 101)
SCF: Error NotSupported:  (propagating from src/common/Amr.h, function getSampleObject(), line 488)
SCF: Error NotSupported:  (propagating from src/components/ac_stages/AeAfApplyStage.cpp, function translateIspOutStatsToFrd(), line 281)
SCF: Error NotSupported:  (propagating from src/components/ac_stages/AeAfApplyStage.cpp, function doHandleRequest(), line 618)
SCF: Error NotSupported:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 137)
SCF: Error NotSupported: Sending critical error event (in src/api/Session.cpp, function sendErrorEvent(), line 990)
SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 667)
(Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 109)
Failed to fetch stats for frame 47963 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter:  (propagating from src/services/autocontrol/NvCameraIspDriver.cpp, function updateStats(), line 323)
SCF: Error BadParameter:  (propagating from src/components/stages/StatsUpdateStage.cpp, function doHandleRequest(), line 170)
SCF: Error BadParameter:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 158)
SCF: Error BadParameter: Sending critical error event (in src/api/Session.cpp, function sendErrorEvent(), line 990)
SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 667)

hello kes25c,

since you’re develop your application to streams all cameras and performs some processing.
could you please narrow down the issue by enable camera streams with argus_camera application.

several item as below for your reference,

A) FYI, we had verified locally (l4t-r32.2/TX2) to stream 6-cameras with argus_camera application in multi-session mode and stress all 6 CPU cores. confirmed it’s running for 2-hours without crashing.

B) Hence, please running with the same environment to gather more information.
suggest to enable tegrastats to monitor the system usage, and also check if the failure related to hardware clock throttling to reduce the clock frequency.
you may access developer guide, Thermal Management in BPMP, you may also check Hardware Throttling session for details.

C) An alternative way for system configuration.
please higher the priority for both argus_camera and nvargus_daemon service.
for example,
C.1) please adjusting the priority for camera processes.

# to modify the priority, please note its ranges from -20 (highest priority) to 19 (lowest priority value), and the default is 0.
$ sudo renice -20 -p <pid>

C.2.1) please assign more CPUs for executing the camera application, for example, two or even three.
C.2.2) please assign other CPUs for executing your application for stressing.

# specify CPU-2 and CPU-3 for executing <pid> application.
$ taskset -c -p 2,3 <pid>

looking forward your testing results,
thanks

Hi Jerry,

A.) I encounter the same issue with argus_camera in multi-session mode. If I start it up, switch to multi-session, and then run stress --cpu 5 I hit the Failed to fetch stats for frame XXX error in under 5 minutes. This is with 32.2. Tested it 3 times, and the result was the same each time.

B.) The tegrastats output shows all six CPUs at 2035. How would I identify thermal throttling in the tegrastats output?

C.) Yes, I’ve seen that increasing the priority for the Argus threads helps reduce the frequency of occurrence. We are already doing this in our real application, but this is only hiding the issue… even with increased priorities it can still happen. If Argus truly needs cores dedicated exclusively to it, it would be good to know the reason why. Otherwise, how do we know how many cores are needed? or if the issue will truly be solved this way?

hello kes25c,

  1. you’ll need to execute tegrastats with sudo permission to enable all details.
    here’s column to indicate the thermal in celsius.
    for example,
... CPU [98%@2034,100%@2035,100%@2035,100%@2035,90%@2035,100%@2034] ... <b>thermal@76.7C</b> ...
  1. please check whether this issue is related to thermal throttling. if yes, you might review your thermal solutions.

  2. in order to make sure requested frame is present, I’ve attach two prebuilt libraries by increasing internal queue buffers.
    please check the attachment, devtalk1065378_Oct25_prebuilts.tar.gz
    you’ll need to replace those prebuilt libraries and perform a warm-reboot to let the changes take effect.
    thanks
    devtalk1065378_Oct25_prebuilts.tar.gz (2.82 MB)

This is the tegrastats output after the system had been idle for a while:

RAM 1702/7851MB (lfb 1272x4MB) CPU [0%@2034,0%@2034,1%@2034,0%@2034,0%@2035,1%@2034] EMC_FREQ 1%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% PLL@28.5C MCPU@28.5C PMIC@100C Tboard@22C GPU@26C BCPU@28.5C thermal@27.5C Tdiode@22.75C VDD_SYS_GPU 113/113 VDD_SYS_SOC 795/795 VDD_4V0_WIFI 0/0 VDD_IN 3621/3643 VDD_SYS_CPU 113/130 VDD_SYS_DDR 1184/1184

This is the output right after the error happened (two different cases):

RAM 2370/7851MB (lfb 1189x4MB) CPU [97%@2022,98%@2035,96%@2034,98%@2023,99%@2021,98%@2026] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% PLL@40C MCPU@40C PMIC@100C Tboard@30C GPU@35.5C BCPU@40C thermal@38.2C Tdiode@32.5C VDD_SYS_GPU 302/151 VDD_SYS_SOC 1625/1657 VDD_4V0_WIFI 0/10 VDD_IN 10362/10203 VDD_SYS_CPU 5215/5163 VDD_SYS_DDR 1566/1564

RAM 2385/7851MB (lfb 1184x4MB) CPU [97%@2034,96%@2035,96%@2034,98%@2035,98%@2036,98%@2035] EMC_FREQ 9%@1866 GR3D_FREQ 6%@1300 APE 150 MTS fg 0% bg 0% PLL@39.5C MCPU@39.5C PMIC@100C Tboard@29C GPU@34.5C BCPU@39.5C thermal@37.4C Tdiode@31.75C VDD_SYS_GPU 377/261 VDD_SYS_SOC 1700/1336 VDD_4V0_WIFI 0/3 VDD_IN 10580/7544 VDD_SYS_CPU 5177/2882 VDD_SYS_DDR 1659/1473

Temps don’t seem close to ranges that would cause throttling, and none of the frequencies have dropped.

I’ll test the binaries.

Ran a 45 minute test with the new binaries and my stream_all app + stress --cpu 5. Cameras were in linear mode (vs HDR). I still hit a few errors, but no more message spamming. In fact, the program kept right on running after the errors, and acquireFrame never timed out. Guessing that’s from your changes? Here were the error messages:

Failed to fetch stats for frame 163479 
Error 0x00000004 occurred at /media/jerry/Hitachi/L4T/T186/r32.x/camera/core_v3/camera_isp/isp/state_update/blocks/stats/LAC.cpp:664
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 163479, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64
Failed to fetch stats for frame 203540 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 203540, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64
Failed to fetch stats for frame 304862 
Error 0x00000004 occurred at /media/jerry/Hitachi/L4T/T186/r32.x/camera/core_v3/camera_isp/isp/state_update/blocks/stats/LAC.cpp:664
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 304862, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64

Same test, but with the cameras in HDR mode. Results were similar. The error occurred more often though. Log is below.

Failed to fetch stats for frame 37888 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 37888, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64
Failed to fetch stats for frame 37887 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 37887, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64

Failed to fetch stats for frame 52760 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 52760, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
Failed to fetch stats for frame 52761 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 52761, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
Failed to fetch stats for frame 52759 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 52759, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64

Failed to fetch stats for frame 189987 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 189987, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)

Failed to fetch stats for frame 202636 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 202636, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
Failed to fetch stats for frame 202649 
Error 0x00000004 occurred at /media/jerry/Hitachi/L4T/T186/r32.x/camera/core_v3/camera_isp/isp/state_update/blocks/stats/LAC.cpp:664
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 202649, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64

Failed to fetch stats for frame 266271 
Error 0x00000004 occurred at /media/jerry/Hitachi/L4T/T186/r32.x/camera/core_v3/camera_isp/isp/state_update/blocks/stats/LAC.cpp:664
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 266271, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
Failed to fetch stats for frame 266269 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 266269, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64
Failed to fetch stats for frame 266268 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 266268, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
Failed to fetch stats for frame 266273 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 266273, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
Failed to fetch stats for frame 266279 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 266279, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
Failed to fetch stats for frame 266274 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
Failed to fetch stats for frame 266272 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 266272, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
SCF: Error BadParameter: There was an error decoding stats for frame 266274, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64
Failed to fetch stats for frame 266278 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 266278, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
Failed to fetch stats for frame 266277 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 266277, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64
Failed to fetch stats for frame 266282 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 266282, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
Failed to fetch stats for frame 266283 
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64
Error 0x00000004 occurred at /media/jerry/Hitachi/L4T/T186/r32.x/camera/core_v3/camera_isp/isp/state_update/blocks/stats/LAC.cpp:664
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 266283, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
Failed to fetch stats for frame 266287 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 266287, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero

Failed to fetch stats for frame 280792 
Error 0x00000004 occurred at /media/jerry/Hitachi/L4T/T186/r32.x/camera/core_v3/camera_isp/isp/state_update/blocks/stats/LAC.cpp:664
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 280792, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)

(NvCamV4l2) Error IoctlFailed:  (in /dvs/git/dirty/git-master_linux/camera/utils/nvcamv4l2/v4l2_device.cpp, function setControlValMultiple(), line 792)
(NvOdmDevice) Error IoctlFailed:  (propagating from dvs/git/dirty/git-master_linux/camera-partner/imager/src/devices/V4L2SensorViCsi.cpp, function setDeviceControls(), line 1856)
updateOutputSettings: Set Control failed. Use cached values

Failed to fetch stats for frame 347058 
SCF: Error BadParameter:  (propagating from src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 58)
SCF: Error BadParameter: There was an error decoding stats for frame 347058, frame stats will be invalid (in src/services/autocontrol/isp_stats/IspLegacyStatsDecoder.cpp, function decodeIspStats(), line 66)
totalCount of histogram is zero
(Autocontrol) Error BadParameter: histFromLac1 histogram totalMass is invalid (in src/algorithms/ae/ae_stats_utils.cpp, function NvIspAeCalcMaxRGBHistogramsFromLac1(), line 287)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeMidtoneMeter(), line 95)
(Autocontrol) Error BadParameter:  (propagating from src/algorithms/ae/plugins/aohdr/AEComputePluginAoHdr.cpp, function computeDerivedStats(), line 352)
AWB: M3Stats VWindows = 0 != 64

hello kes25c,

it seems your multiple camera streaming works with applied libraries, devtalk1065378_Oct25_prebuilts.tar.gz
may I also know what’s your long run criteria,
BTW,
those SCF failure messages may related to your environment setups.
may I know what’s the scene your sensors captured, is it under low-light environment?
it should worth a try by limit AE function for confirmation, please enable ae-lock with argus_camera application for confirmation.
thanks

Yes, the streaming works with the devtalk1065378_Oct25_prebuilts.tar.gz binaries. Ideally we’d be able to have all six cameras stream indefinitely (assuming no thermal issues). In practice, we’d have them running for ~8-10 hours at a time. The longest test I’ve done so far is ~2 hours.

I’ve been testing in a typical office environment with fluorescent lighting and some sunlight coming in through a window. It’s not low-light. I will try enabling ae-lock tomorrow morning.

I’m also curious what was changed in the binaries you provided? You mentioned increased internal queue buffers, was that the only change? It still seemed to hit the ISP related error, but there was no message spam or timeout in acquireFrame() when the error happened. Why was that exactly? That is actually the behavior we would like - if the ISP encounters an error on a frame, simply drop that frame and keep going - versus having the CaptureSession completely stop working.

hello kes25c,

camera architecture already support multi-cam use-case.
there’re several buffer transmit from low-level sensor driver layer to internal camera core stack, and finally rendering to frame display.
such Argus failure is due to a race condition between the buffer transmit to each stages, which only observed with high CPU loading.

FYI, according to Jetson TX2 Series Software Features of CSI and USB Camera Features.
you’ll also note that it did not claim running multi-cam solution with CPU stressed.

since I’m not able to recreate the failure in comment #6 with argus_camera application.
suggest you should also review your application implementation.
thanks

  • [i]since I'm not able to recreate the failure in comment #6 with argus_camera application. suggest you should also review your application implementation.[/i]

argus_camera produces the same errors for me. In fact, argus_camera produces them more frequently. Having AE-Lock enabled made no difference. Are you testing in multi-process or single-process mode? I built argus_camera in single-process.

  • [i]there're several buffer transmit from low-level sensor driver layer to internal camera core stack, and finally rendering to frame display. such Argus failure is due to a race condition between the buffer transmit to each stages, which only observed with high CPU loading.[/i]

I’m still confused as to what changed in the binaries you provided. It’s still hitting the Failed to fetch stats for frame XXX error, it just isn’t spamming error messages afterward and is able to continue. Is that effect from increasing the internal queue sizes? I figured that increasing the queue sizes would only help in preventing the error… not change behavior after the error is encountered.

Also, it isn’t only observed with high CPU loading. High CPU load just triggers it more easily. If there’s a race condition that can happen with high cpu load, then it can happen with low cpu load. You’re just hoping that it’s infrequent, which isn’t a viable plan for something in production.

Interestingly, I noticed that when other errors happen (i.e. not the Failed to fetch stats for frame XXX) I’m still getting the high rate message spam even with the new binaries. For example, when running argus_camera if I switch the sensor mode while in multi-session mode I almost always hit this error:

SCF: Error Timeout:  (propagating from src/components/amr/Snapshot.cpp, function waitForNewerSample(), line 92)
SCF_AutocontrolACSync failed to wait for an earlier frame to complete.
SCF: Error Timeout:  (propagating from src/components/ac_stages/ACSynchronizeStage.cpp, function doHandleRequest(), line 126)
SCF: Error Timeout:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 137)
SCF: Error Timeout: Sending critical error event (in src/api/Session.cpp, function sendErrorEvent(), line 990)

And then I get the high-rate repeated message spam:

SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 667)
(Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 109)
  • FYI, according to Jetson TX2 Series Software Features of CSI and USB Camera Features. you'll also note that it did not claim running multi-cam solution with CPU stressed.

What is the definition of CPU stressed to know when it should work or not?

hello kes25c,

What is the definition of CPU stressed to know when it should work or not?
it is NOT a realistic use-case by executing a test tools to stress CPU, (i.e. $ stress --cpu 5)
please share what’s your real use-case.

Are you testing in multi-process or single-process mode?
we’re running Argus Version: 0.97.3 with multi-process mode.

in addition,
we could share the pre-built libraries to tune the buffers in queue, the major purpose is to avoid the crash, also make the error decoding stats as non-fatal, just to return bad status.
please tried another pre-built libraries, devtalk1065378_Oct29_prebuilts.tar.gz
these also increase the timeout values and also configure the internal queue buffers.

again, a stress test tool to maximum CPU usage will induce frame-drop, that’s life.
you’ll need to evaluate the real workload, to have system optimization, to adjust priority…etc.
suggest you should also have system configuration, please refer to comment #2 for the steps.

to evaluate the workloads, you may use ubuntu default tool (i.e. $ top) to evaluate your system loading.
or, you could access nsight-systems tools to profile CPU usage.
thanks

devtalk1065378_Oct29_prebuilts.tar.gz (2.82 MB)

>>> it is NOT a realistic use-case by executing a test tools to stress CPU, (i.e. $ stress --cpu 5)
please share what’s your real use-case.

This was mentioned in the other thread, but you are focusing too much on the use of the stress program. It is only an accelerant. It doesn’t have to be stress --cpu 5. It happens with stress --cpu 3 or 2 or 1… just not as often. I’ve hit them with only the argus_camera app running in multi-session mode and nothing else, and we have hit them under normal load. For example, just doing some image pre-processing and depth on the gpu with minimal cpu usage besides Argus. That’s why I came to the forum. The stress program is simply a tool that helps to trigger the issue more readily.

>>> again, a stress test tool to maximum CPU usage will induce frame-drop, that’s life.
you’ll need to evaluate the real workload, to have system optimization, to adjust priority…etc.
suggest you should also have system configuration, please refer to comment #2 for the steps.

Frame drop is perfectly acceptable. I expect that to happen as the system load increases. The issue is not dropping frames, the issue is errors happening under the covers with no clean way to recover as mentioned by ben.lemond in the original thread. The current behavior results in a huge amount of log spam and the CaptureSession gets into a non-recoverable state. Trying to get the cameras going again after this happens only occasionally works and sometimes results in a segfault. Those are the issues. Currently, it seems that all of these errors are treated as fatal even though it’s possible to drop the frame and continue (at least for some of the errors). Argus should provide a better way for client apps to deal with these errors that can happen under the covers.

>>> we could share the pre-built libraries to tune the buffers in queue, the major purpose is to avoid the crash, also make the error decoding stats as non-fatal, just to return bad status.

OK, that’s what I was wondering. So you made the decoding stats error non-fatal, which is why it can continue after it happens. That behavior is much better for us, but it only helps with that one error unfortunately.

My intention is not to be combative. It’s to get a reliable system in place. You say that CPU stressed isn’t supported, but don’t define what that means. If necessary, we can restrict our code from running on two cores in order to reserve them for Argus, but I have not seen evidence that this will eliminate the issue. Only make it less likely to happen - in the same way that lower cpu load does. Since it happens with only argus_camera running, it seem unlikely to me that reserving cores for Argus will fix things. Maybe you could explain why these issues become impossible if we reserve two cores or more for Argus?

hello, kes25c

I’ve hit them with only the argus_camera app running in multi-session mode and nothing else, and we have hit them under normal load. For example, just doing some image pre-processing and depth on the gpu with minimal cpu usage besides Argus. That’s why I came to the forum. The stress program is simply a tool that helps to trigger the issue more readily.

okay, I might jump to conclusion too fast.
according to CSI and USB Camera Features, Preview performance of 30 frames/second for 1920×1440 resolution with six OV5693 sensors running simultaneously.

FYI,
we’re also verified there’s no such failure either with/without CPU stress tool.
let’s check whether this issue related to sensor drivers.

  1. may I know which sensor you’re working with, what’s the outputting resolution and frame-rate settings.
  2. besides running with argus multi-session mode, could you please tried launching six argus instance to check if could reproduce the same failures.
  3. you might also have confirmation to setup 6-cam streaming with v4l2 standard controls,
    for example,
$ v4l2-ctl -d /dev/video0 --set-fmt-video=width=1920,height=1080,pixelformat=RG10 --set-ctrl bypass_mode=0 --stream-mmap
$ v4l2-ctl -d /dev/video1 --set-fmt-video=width=1920,height=1080,pixelformat=RG10 --set-ctrl bypass_mode=0 --stream-mmap
...
$ v4l2-ctl -d /dev/video5 --set-fmt-video=width=1920,height=1080,pixelformat=RG10 --set-ctrl bypass_mode=0 --stream-mmap

Hi Jerry,

I was working on a streaming application using argus with a 3-camera setup. I ran into the same issue:

SCF: Error Timeout:  (propagating from src/components/amr/Snapshot.cpp, function waitForNewerSample(), line 92)
SCF_AutocontrolACSync failed to wait for an earlier frame to complete.
SCF: Error Timeout:  (propagating from src/components/ac_stages/ACSynchronizeStage.cpp, function doHandleRequest(), line 126)
SCF: Error Timeout:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 137)
SCF: Error Timeout: Sending critical error event (in src/api/Session.cpp, function sendErrorEvent(), line 990)

And then I get the high-rate repeated message spam:

SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 667)
(Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 109)

I am running R28.2.1 on my TX2 board with 3 IMX185 sensors. The drivers were provided by Leopard Imaging. Currently, Leopard Imaging does not have driver support for R32.2, which seems to have fixed some of these issues. Would R32.1 have the same fixes? Or, would I need to just wait for LI to come out with support for 32.2?

hello vision2,

to clarify, we did had bugs for R32.1 for multi-camera use-case.
therefore, the next l4t release, R32.2 include the fix to address some race condition failures.

since we suggest based-on R32.2 for multiple camera use-case implementation,
suggest you should ask your sensor vendor for R32.2 driver support,
thanks

Hi JerryChang

Let me clarify the L4T version you mentioned.

In https://developer.nvidia.com/embedded/downloads,
L4T 32.2.3 are already release on 2019/11/19

  • What does ‘R32.2’ mean?
  • Is the fix you mentioned same as the patch attached in comment #4

hello rary,

there’re JetPack release and also L4T sources packages.
however, it’s pre-built library update which only release with the JetPack images.

therefore, please check JetPack Archive for details.
thanks

Hi JerryChang,

Sorry for the late response. I was focused on other tasks. To answer your questions:

>>> 1) may I know which sensor you’re working with, what’s the outputting resolution and frame-rate settings.

We’re using the OV10640. 1280x1080 resolution @ 30fps. Output of v4l2-ctl -d /dev/video0 --list-formats-ext:

ioctl: VIDIOC_ENUM_FMT
	Index       : 0
	Type        : Video Capture
	Pixel Format: 'BA12'
	Name        : 12-bit Bayer GRGR/BGBG
		Size: Discrete 1280x1080
			Interval: Discrete 0.033s (30.000 fps)
		Size: Discrete 1280x1080
			Interval: Discrete 0.033s (30.000 fps)

	Index       : 1
	Type        : Video Capture
	Pixel Format: 'BG12'
	Name        : 12-bit Bayer BGBG/GRGR
		Size: Discrete 1280x1080
			Interval: Discrete 0.033s (30.000 fps)
		Size: Discrete 1280x1080
			Interval: Discrete 0.033s (30.000 fps)

>>> 2) besides running with argus multi-session mode, could you please tried launching six argus instance to check if could reproduce the same failures.

I should be able to test this tomorrow, and will let you know the result.

>>> 3) you might also have confirmation to setup 6-cam streaming with v4l2 standard controls,

I tried 6 camera streaming with v4l2 and stress --cpu 5. It ran for 30 minutes (twice) without problems, maintaining 30fps. The commands used were:

v4l2-ctl -d /dev/videoX --set-fmt-video=width=1280,height=1080,pixelformat=BG12 --set-ctrl bypass_mode=0 --stream-mmap --stream-count 54000

>>> we’re also verified there’s no such failure either with/without CPU stress tool.

Was that using argus_camera in multi-session mode with six cameras and stress --cpu 5? Or some other setup? and running for how long?

With 32.2, as long as we set a high (-20) priority for argus related threads, this issue happens pretty rarely under our normal system load.

I tested running six separate instances of argus_camera in capture mode, and am able to trigger the issue.

I also observed a separate issue related to multi-session mode, but it sort of fits here since it results in the high-rate repeated error spam. When running argus_camera (single process) in multi-session mode, if I switch the sensor mode it almost always triggers the following error (which is then followed by the high-rate repeating error messages):

SCF: Error Timeout:  (propagating from src/components/amr/Snapshot.cpp, function waitForNewerSample(), line 92)
SCF_AutocontrolACSync failed to wait for an earlier frame to complete.
SCF: Error Timeout:  (propagating from src/components/ac_stages/ACSynchronizeStage.cpp, function doHandleRequest(), line 126)
SCF: Error Timeout:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 137)

In our case, there are two sensor modes. 0 (HDR) and 1 (linear). The steps are:

  1. start argus_camera
  2. switch to multisession mode
  3. wait a little while (for all the cameras to show up and stream)
  4. switch the sensor mode

Interestingly, the behavior appears different between single and multi process builds of argus_camera. If I repeat these steps with the multi-process build, I don’t see any errors show up in /var/log/syslog. The argus_camera app just hangs and becomes unresponsive.