We’re hitting a reproducible nvargus-daemon deadlock on JP 6.2.2 (L4T r36.5.0, R36 REVISION 5.0) with 7–8 concurrent ZED-X camera streams via nvarguscamerasrc on AGX Orin. The daemon
wedges without crashing - accept() still runs on the control socket, systemd sees a healthy unit, but all capture sessions return UNAVAILABLE. Requires manual systemctl restart
nvargus-daemon to recover.
This is a follow-up to thread 361085 where @JerryChang confirmed fixes ship in JP 6.2.2. We upgraded, verified binaries (sha256 match against stock BSP), and the bug persists.
Environment (clean flash, no modifications to NVIDIA libraries):
- JetPack 6.2.2 / L4T r36.5.0 (GCID 43688277)
- Kernel: 5.15.185-rt-tegra (PREEMPT_RT)
- AGX Orin, 7x ZED-X one s = 7 active streams
- 1x Stereloabs quad capture card
- ZED Link driver v1.4.1, ZED SDK v5.2.3
- enableCamInfiniteTimeout=1 set in nvargus-daemon.service
- libnvscf.so sha256: 944bf7830e342181c0e92361375753762889b643dbcaba6672d63cd002bfcd20
Trigger (from journalctl -u nvargus-daemon):
Module_id 30 Severity 2 : (fusa) Error: InvalidState Status syncpoint signaled but status value not updated in:/capture/src/fusaViHandler.cpp 869
Module_id 30 Severity 2 : (fusa) Error: InvalidState propagating from:/capture/src/fusaViHandler.cpp 811
Only these 2 lines, then silence. No crash, no recovery, daemon stops processing.
gdb backtrace of wedged daemon (79 threads, attached to PID via gdb -batch -ex “thread apply all bt full”):
- Thread 1 (main): accept() on Argus socket — daemon appears alive
- 12x SCF Execution threads: all in pthread_cond_wait — worker pool idle, no work dispatched
- 2x CaptureScheduler threads: pthread_cond_wait via NvOsSemaphoreWaitTimeout in libnvscf.so at offset 0xffff9e28cc04
- 2x V4L2CaptureScheduler: same pthread_cond_wait pattern in libnvscf.so
- 6 healthy sensor groups (~42 threads): ioctl() in V4L2 DQBUF — normal
- 1 wedged sensor group (~7 threads): pthread_cond_wait in libnvscf.so — the failed channel (src5)
- 4 client session handlers: pthread_cond_wait — blocked waiting for events
Key observation: Zero threads in pthread_mutex_lock. All blocked threads are in pthread_cond_wait. The CaptureScheduler worker hit the fusa error path at fusaViHandler.cpp:869 and
failed to signal the condvar that downstream threads are waiting on. This is a lost-wakeup — the condition variable signal is never sent (or consumed by the wrong waiter), so all
dependent threads wait forever. libnvscf.so code addresses 0xffff9e28cc04, 0xffff9e2e5698, 0xffff9e2e5c80 are the blocked sites.
On our earlier JP 6.2.1 captures the same bug manifested as a mutex-leak (threads stuck in pthread_mutex_lock with no holder). JP 6.2.2 may have fixed that specific path but left the
condvar-signal path in the same error handler broken.
Resource timeline (10-second samples over 3.5 h leading up to failure):
- RSS: stable ~1.26 GB → spiked to ~2.08 GB post-wedge (retained in-flight buffers, not a leak)
- Threads: 79 throughout (no thread leak)
- FDs: 132 throughout (no fd leak)
- fusa_errors: 0 for 3.5 h → jumped to 2 in a single sample at failure
- No gradual degradation — failure is instantaneous
What we have available:
- Full thread apply all bt full output (288 KB, 79 threads)
- Per-thread kernel stacks (/proc//task/*/stack)
- Per-thread /proc/task/*/status
- /proc//maps, /proc//status, fd listing, environment
- journalctl -u nvargus-daemon full dump (last 2000 lines)
- 10-second CSV resource timeline (RSS, threads, fds, error counts)
- dmesg — clean, no kernel-side CSI/VI/host1x errors in any of our 6 runs
Happy to share the full forensic bundle. This is reproducible on demand — every run eventually wedges.
Ask: Can the libnvscf.so team review the error path at fusaViHandler.cpp:869 → CaptureScheduler for a missing pthread_cond_signal / NvOsSemaphoreSignal on the InvalidState branch?
The r36.5.0 binary still has this bug.