"fence timeout on [ffffffc02a6cd600] after 1500ms" after about 17000 seconds

Hello nvidia,

I am still struggling with imx264 sensors and driver, but I am now able to run two “endless” gstreamer pipelines on two jetson-tx1 based custom boards, each with two imx264 sensors. I thus run two pipelines on each board.

gst-launch-1.0 nvcamerasrc sensor-id=${SENSOR_ID} ! 'video/x-raw(memory:NVMM), width=(int)1936, height=(int)1080, format=(string)I420, framerate=(fraction)30/1' ! nvvidconv ! 'video/x-raw(memory:NVMM), format=(string)I420' ! nvjpegenc ! multifilesink max-files=5

This is with l4t 28.2.1

Unfortunately, the pipelines are not really “endless”, because they are blocked by the kernel after about 17000 seconds, by a “fence timeout” problem. The pipelines were started manually on each board “quickly” after a reboot.

Here are the log excerpts from the two development boards.

[17119.495914] fence timeout on [ffffffc02a6cd600] after 1500ms
[17119.495924] fence timeout on [ffffffc02a6cd700] after 1500ms
[17119.496170] name=[nvhost_sync:15], current value=205805 waiting value=205806
[17119.496269] ---- mlocks ----
[17119.496284] 10: locked by channel 4
[17119.496316]
[17119.496318] ---- syncpts ----
[17267.368932] fence timeout on [ffffffc0d8779d00] after 1500ms
[17267.383781] name=[nvhost_sync:73], current value=452125 waiting value=452126
[17267.384820] fence timeout on [ffffffc0ebb21500] after 1500ms
[17267.384847] name=[nvhost_sync:7], current value=452126 waiting value=452127
[17267.384867] ---- mlocks ----
[17267.384890]
[17267.384897] ---- syncpts ----

I have restarted the test a second time on one board, again with a similar result :

[17216.871534] fence timeout on [ffffffc0c2d63600] after 1500ms
[17216.877434] name=[nvhost_sync:66], current value=446893 waiting value=446894
[17216.895306] ---- mlocks ----
[17216.895635] fence timeout on [ffffffc0e6c15600] after 1500ms
[17216.895839] 3: locked by channel 4
[17216.895853]
[17216.895858] ---- syncpts ----

Is that a known problem (with a known solution) ?

hello phdm,

we may need more details, looking for your feedback for below questions,
thanks

  1. may I have confirmation of your long-run criteria.
  2. is this depends on launching couple of cameras together? could you reproduce this with single camera around 4 hours.

Hi JerryChang,

I confirm that we need to have our pipe-lines run endlessly.

I have restarted the test with a single camera, and I’ll let you know what happens

Hi JerryChang,

Yes, with only one sensor, the hang happens also after about 17000 s.

Here are the first kernel messages after the startup phase of my pipeline :

[10858.788042] gk20a 57000000.gpu: gr_gk20a_load_falcon_bind_instblk: arbiter complete timeout
[17385.023071] fence timeout on [ffffffc026606f00] after 1500ms
[17385.029130] name=[nvhost_sync:11], current value=494601 waiting value=494602
[17385.031047] fence timeout on [ffffffc0241ae200] after 1500ms
[17385.031109] name=[nvhost_sync:12], current value=494601 waiting value=494602
[17385.031160] ---- mlocks ----
[17385.031206]
[17385.031224] ---- syncpts ----

and I attach the whole log of messages that come after.

cam2-17000s.log (64.9 KB)

hello phdm,

FYI, confirmed we cannot reproduce fence timeout failure after 17000 seconds,
we verify this by setup an environment with our camera reference board for running same commands overnight.

share some of kernel messages for your reference,
I only save below logs: (already stop running and change to run other test)

[ 3979.531021] extcon-gpio-states extcon:extcon@1: Cable state 1
[ 3979.537406] tegra-xudc-new 700d0000.xudc: exiting ELPG
[ 3979.544611] tegra-xudc-new 700d0000.xudc: exiting ELPG done
[ 3979.550472] tegra-xudc-new 700d0000.xudc: device mode on
[ 3979.695210] extcon-gpio-states extcon:extcon@1: Cable state 0
[ 3979.701294] tegra-xudc-new 700d0000.xudc: device mode off
[ 3979.706884] tegra-xudc-new 700d0000.xudc: entering ELPG
[ 3979.713192] tegra-xudc-new 700d0000.xudc: entering ELPG done
[61458.215991] extcon-gpio-states extcon:extcon@1: Cable state 1
[61458.222258] tegra-xudc-new 700d0000.xudc: exiting ELPG
[61458.238924] tegra-xudc-new 700d0000.xudc: exiting ELPG done
[61458.248969] tegra-xudc-new 700d0000.xudc: device mode on
[61458.779782] configfs-gadget gadget: high-speed config #1: c
[61458.785452] tegra-xudc-new 700d0000.xudc: ep 5 (type: 3, dir: in) enabled
[61458.792272] tegra-xudc-new 700d0000.xudc: ep 3 (type: 2, dir: in) enabled
[61458.799087] tegra-xudc-new 700d0000.xudc: ep 2 (type: 2, dir: out) enabled
[61458.806256] tegra-xudc-new 700d0000.xudc: ep 9 (type: 3, dir: in) enabled
[61458.813097] tegra-xudc-new 700d0000.xudc: ep 7 (type: 2, dir: in) enabled
[61458.819937] tegra-xudc-new 700d0000.xudc: ep 4 (type: 2, dir: out) enabled
[61458.827334] tegra-xudc-new 700d0000.xudc: ep 15 (type: 3, dir: in) enabled
[61458.834293] tegra-xudc-new 700d0000.xudc: ep 11 (type: 2, dir: in) enabled
[61458.843290] l4tbr0: port 1(usb0) entered forwarding state
[61458.844665] tegra-xudc-new 700d0000.xudc: ep 6 (type: 2, dir: out) enabled
[61458.856311] tegra-xudc-new 700d0000.xudc: setup request failed: -95

you may dig into the low-level sensor drivers to confirm long run stability.
thanks

Hi JerryChang,

We got a confirmation from Lattice, which provides the Sub LVDS to MIPI CSI-2 bridge that we uses that that’s a “feature” of their free “IP block”. They have provided us now a “license” file (free also) that should fix the problem. Tests are currently running.

Sorry for the noise here