Libnvscf.so CaptureScheduler deadlock persists on JP 6.2.2 / L4T r36.5.0 — gdb backtrace + resource timeline attached

We’re hitting a reproducible nvargus-daemon deadlock on JP 6.2.2 (L4T r36.5.0, R36 REVISION 5.0) with 7–8 concurrent ZED-X camera streams via nvarguscamerasrc on AGX Orin. The daemon
wedges without crashing - accept() still runs on the control socket, systemd sees a healthy unit, but all capture sessions return UNAVAILABLE. Requires manual systemctl restart
nvargus-daemon to recover.

This is a follow-up to thread 361085 where @JerryChang confirmed fixes ship in JP 6.2.2. We upgraded, verified binaries (sha256 match against stock BSP), and the bug persists.

Environment (clean flash, no modifications to NVIDIA libraries):

  • JetPack 6.2.2 / L4T r36.5.0 (GCID 43688277)
  • Kernel: 5.15.185-rt-tegra (PREEMPT_RT)
  • AGX Orin, 7x ZED-X one s = 7 active streams
  • 1x Stereloabs quad capture card
  • ZED Link driver v1.4.1, ZED SDK v5.2.3
  • enableCamInfiniteTimeout=1 set in nvargus-daemon.service
  • libnvscf.so sha256: 944bf7830e342181c0e92361375753762889b643dbcaba6672d63cd002bfcd20

Trigger (from journalctl -u nvargus-daemon):
Module_id 30 Severity 2 : (fusa) Error: InvalidState Status syncpoint signaled but status value not updated in:/capture/src/fusaViHandler.cpp 869
Module_id 30 Severity 2 : (fusa) Error: InvalidState propagating from:/capture/src/fusaViHandler.cpp 811
Only these 2 lines, then silence. No crash, no recovery, daemon stops processing.

gdb backtrace of wedged daemon (79 threads, attached to PID via gdb -batch -ex “thread apply all bt full”):

  • Thread 1 (main): accept() on Argus socket — daemon appears alive
  • 12x SCF Execution threads: all in pthread_cond_wait — worker pool idle, no work dispatched
  • 2x CaptureScheduler threads: pthread_cond_wait via NvOsSemaphoreWaitTimeout in libnvscf.so at offset 0xffff9e28cc04
  • 2x V4L2CaptureScheduler: same pthread_cond_wait pattern in libnvscf.so
  • 6 healthy sensor groups (~42 threads): ioctl() in V4L2 DQBUF — normal
  • 1 wedged sensor group (~7 threads): pthread_cond_wait in libnvscf.so — the failed channel (src5)
  • 4 client session handlers: pthread_cond_wait — blocked waiting for events

Key observation: Zero threads in pthread_mutex_lock. All blocked threads are in pthread_cond_wait. The CaptureScheduler worker hit the fusa error path at fusaViHandler.cpp:869 and
failed to signal the condvar that downstream threads are waiting on. This is a lost-wakeup — the condition variable signal is never sent (or consumed by the wrong waiter), so all
dependent threads wait forever. libnvscf.so code addresses 0xffff9e28cc04, 0xffff9e2e5698, 0xffff9e2e5c80 are the blocked sites.

On our earlier JP 6.2.1 captures the same bug manifested as a mutex-leak (threads stuck in pthread_mutex_lock with no holder). JP 6.2.2 may have fixed that specific path but left the
condvar-signal path in the same error handler broken.

Resource timeline (10-second samples over 3.5 h leading up to failure):

  • RSS: stable ~1.26 GB → spiked to ~2.08 GB post-wedge (retained in-flight buffers, not a leak)
  • Threads: 79 throughout (no thread leak)
  • FDs: 132 throughout (no fd leak)
  • fusa_errors: 0 for 3.5 h → jumped to 2 in a single sample at failure
  • No gradual degradation — failure is instantaneous

What we have available:

  • Full thread apply all bt full output (288 KB, 79 threads)
  • Per-thread kernel stacks (/proc//task/*/stack)
  • Per-thread /proc/task/*/status
  • /proc//maps, /proc//status, fd listing, environment
  • journalctl -u nvargus-daemon full dump (last 2000 lines)
  • 10-second CSV resource timeline (RSS, threads, fds, error counts)
  • dmesg — clean, no kernel-side CSI/VI/host1x errors in any of our 6 runs

Happy to share the full forensic bundle. This is reproducible on demand — every run eventually wedges.

Ask: Can the libnvscf.so team review the error path at fusaViHandler.cpp:869 → CaptureScheduler for a missing pthread_cond_signal / NvOsSemaphoreSignal on the InvalidState branch?
The r36.5.0 binary still has this bug.

hello aadiaadi12345,

let me have confirmation, is it related to number of cameras? for instance, can you reproduce the same issue with 6-cam setup?

Hey I was able to reproduce the same error with a 6 camera test setup.

we would like to reproduce the same on Jetson AGX Orin developer kit with 6-cam reference board.
could you please share your test steps for reference,
for instance, did you reproduce the issue by putting 6-cam preview free running? or.. there’s background service running to have deadlock issue happened?

So our test setup was closer to 6-cam preview free running i.e. opened 6 camera sessions which were streamed using gstreamer nvarguscamerasrc. Every stream is an nvarguscamerasrc element. The sessions are opened ones when pipeline starts and held for the life time of the process. Besides that we just have passive watchers for nvargus-daemon, dmesg, etc. Let me know if you need any specific detail that I might have missed or if I misunderstood your request.