Vi2 CSI_A syncpt timeout

Hello, we found a way to to make this issue reproducible with stress-ng :

The main idea is to generate a high level of memcopy in rt prio scheduling with high memory usage condition

In our case, using a 300M allocation from stress-ng let us have 82MB of free memory

MemFree:           82372 kB
RAM 1732/1893MB (lfb 11x4MB) cpu [99%,97%,98%,98%]@2065 EMC 28%@924 AVP 0%@204 VDE 120 GR3D 0%@12 EDP limit 0

We also added the output of cyclic test , above each test case :

./cyclictest -t1 -p 95 -n -i 10000 -l 10000 -D 100

100% reproducible test case RT prio 49 - 60 s: Syncpt Timeout

./stress-ng  --sched fifo --sched-prio 49 --vm-rw 8 --vm-rw-bytes 300M -t 60 --metrics-brief 
T: 0 (15660) P:95 I:10000 C:    960 Min:     19 Act:   42 Avg:  <b>532</b> Max:   <b>47583</b>

50% reproducible test case RT prio 1 - 120 s: Syncpt Timeout

./stress-ng  --sched fifo --sched-prio 1 --vm-rw 8 --vm-rw-bytes 300M -t 120 --metrics-brief 
T: 0 (15430) P:95 I:10000 C:   9600 Min:     15 Act:   37 Avg:  <b>511</b> Max:   <b>45502</b>

As you can see the kernel generate abnormal latency on userland thread at least ( up to 47ms of freeze). so what are the maximun values for those kernel internal latency between IRQ and capture_done() / capture_start() call ?.

not reproducible test case Normal prio - 120 s: image grabbing is working

./stress-ng --vm-rw 8 --vm-rw-bytes 300M -t 120 --metrics-brief 
T: 0 (15257) P:95 I:10000 C:    999 Min:      9 Act:   21 Avg:   30 Max:      57

In RT mode, it seems that kernel memory allocation is very slow, because monitoring process like “perf record”(that allocate kernel buffer) are almost frozen. tegrastats on the contrary behaves normaly.

From the “perf record”, we have the following events recorded.

stress-ng-vm-rw 11782 [000] 94320.256270: irq:irq_handler_entry: irq=97 name=host_syncpt
  host_syncpt_hp    48 [002] 94320.256325: tegravi:tegravi_syncpt_process_wait_list_enter: idcsi=9 ts=94319.221425617
  host_syncpt_hp    48 [002] 94320.256333: tegravi:tegravi_syncpt_process_wait_list_before_run_handlers: idcsi=9 ts=94319.221425617
  host_syncpt_hp    48 [002] 94320.256343: tegravi:tegravi_syncpt_process_wait_list_exit: idcsi=9 ts=94319.221425617
 tegra-vi/captur  8696 [003] 94320.256376: tegravi:tegravi_syncpt_process_wait_list_enter: idcsi=9 ts=94319.221425617
 tegra-vi/captur  8696 [003] 94320.256383: tegravi:tegravi_syncpt_process_wait_list_before_run_handlers: idcsi=9 ts=94319.221425617
 tegra-vi/captur  8696 [003] 94320.256387: tegravi:tegravi_syncpt_process_wait_list_exit: idcsi=9 ts=94319.221425617
 tegra-vi/captur  8696 [003] 94320.256396: tegravi:tegravi_capturewait_exit: port=1 idcsi=9 syncpt=11029907 error=0
 tegra-vi/captur  8696 [003] 94320.256411: tegravi:tegravi_capturestart_enter: port=1 idcsi=9 syncpt=11029907
 tegra-vi/captur  8696 [003] 94320.256420: tegravi:tegravi_capturestart_exit: port=1 idcsi=9 syncpt=11029908 error=0
 tegra-vi/captur  8696 [003] 94320.256426: tegravi:tegravi_capturewait_enter: port=1 idcsi=9 syncpt=11029908
 stress-ng-vm-rw 11782 [000] 94320.273029: irq:irq_handler_entry: irq=97 name=host_syncpt
  host_syncpt_hp    48 [002] 94320.273086: tegravi:tegravi_syncpt_process_wait_list_enter: idcsi=9 ts=94319.238184613
  host_syncpt_hp    48 [002] 94320.273094: tegravi:tegravi_syncpt_process_wait_list_before_run_handlers: idcsi=9 ts=94319.238184613
  host_syncpt_hp    48 [002] 94320.273104: tegravi:tegravi_syncpt_process_wait_list_exit: idcsi=9 ts=94319.238184613
 tegra-vi/captur  8697 [001] 94320.273124: tegravi:tegravi_syncpt_process_wait_list_enter: idcsi=9 ts=94319.238184613
 tegra-vi/captur  8697 [001] 94320.273130: tegravi:tegravi_syncpt_process_wait_list_before_run_handlers: idcsi=9 ts=94319.238184613
 tegra-vi/captur  8697 [001] 94320.273134: tegravi:tegravi_syncpt_process_wait_list_exit: idcsi=9 ts=94319.238184613
 tegra-vi/captur  8697 [001] 94320.273143: tegravi:tegravi_capturedone_exit: port=1 idcsi=9 syncpt=11029908 error=0
 tegra-vi/captur  8697 [001] 94320.273160: tegravi:tegravi_capturedone_enter: port=1 idcsi=9 syncpt=11029908
 stress-ng-vm-rw 11782 [000] 94320.289642: irq:irq_handler_entry: irq=97 name=host_syncpt
  host_syncpt_hp    48 [002] 94320.289692: tegravi:tegravi_syncpt_process_wait_list_enter: idcsi=9 ts=94319.254797278
  host_syncpt_hp    48 [002] 94320.289698: tegravi:tegravi_syncpt_process_wait_list_before_run_handlers: idcsi=9 ts=94319.254797278
  host_syncpt_hp    48 [002] 94320.289705: tegravi:tegravi_syncpt_process_wait_list_exit: idcsi=9 ts=94319.254797278
 tegra-vi/captur  8696 [003] 94320.289723: tegravi:tegravi_syncpt_process_wait_list_enter: idcsi=9 ts=94319.254797278
 tegra-vi/captur  8696 [003] 94320.289727: tegravi:tegravi_syncpt_process_wait_list_before_run_handlers: idcsi=9 ts=94319.254797278
 tegra-vi/captur  8696 [003] 94320.289731: tegravi:tegravi_syncpt_process_wait_list_exit: idcsi=9 ts=94319.254797278
 tegra-vi/captur  8696 [003] 94320.289738: tegravi:tegravi_capturewait_exit: port=1 idcsi=9 syncpt=11029909 error=0
 tegra-vi/captur  8696 [003] 94320.289752: tegravi:tegravi_capturestart_enter: port=1 idcsi=9 syncpt=11029909
 tegra-vi/captur  8696 [003] 94320.289761: tegravi:tegravi_capturestart_exit: port=1 idcsi=9 syncpt=11029910 error=0
 tegra-vi/captur  8696 [003] 94320.289767: tegravi:tegravi_capturewait_enter: port=1 idcsi=9 syncpt=11029910
 stress-ng-vm-rw 11782 [000] 94320.306448: irq:irq_handler_entry: irq=97 name=host_syncpt
  host_syncpt_hp    48 [002] 94320.306475: tegravi:tegravi_syncpt_process_wait_list_enter: idcsi=9 ts=94319.271602525
  host_syncpt_hp    48 [002] 94320.306480: tegravi:tegravi_syncpt_process_wait_list_before_run_handlers: idcsi=9 ts=94319.271602525
  host_syncpt_hp    48 [002] 94320.306487: tegravi:tegravi_syncpt_process_wait_list_exit: idcsi=9 ts=94319.271602525
 tegra-vi/captur  8697 [001] 94320.306503: tegravi:tegravi_syncpt_process_wait_list_enter: idcsi=9 ts=94319.271602525
 tegra-vi/captur  8697 [001] 94320.306509: tegravi:tegravi_syncpt_process_wait_list_before_run_handlers: idcsi=9 ts=94319.271602525
 tegra-vi/captur  8697 [001] 94320.306511: tegravi:tegravi_syncpt_process_wait_list_exit: idcsi=9 ts=94319.271602525
 tegra-vi/captur  8697 [001] 94320.306520: tegravi:tegravi_capturedone_exit: port=1 idcsi=9 syncpt=11029910 error=0
 tegra-vi/captur  8697 [001] 94320.306533: tegravi:tegravi_capturedone_enter: port=1 idcsi=9 syncpt=11029910
 stress-ng-vm-rw 11782 [000] 94320.323014: irq:irq_handler_entry: irq=97 name=host_syncpt
  host_syncpt_hp    48 [002] 94320.323069: tegravi:tegravi_syncpt_process_wait_list_enter: idcsi=9 ts=94319.288169023
  host_syncpt_hp    48 [002] 94320.323078: tegravi:tegravi_syncpt_process_wait_list_before_run_handlers: idcsi=9 ts=94319.288169023
  host_syncpt_hp    48 [002] 94320.323089: tegravi:tegravi_syncpt_process_wait_list_exit: idcsi=9 ts=94319.288169023
 tegra-vi/captur  8696 [003] 94320.323133: tegravi:tegravi_syncpt_process_wait_list_enter: idcsi=9 ts=94319.288169023
 tegra-vi/captur  8696 [003] 94320.323141: tegravi:tegravi_syncpt_process_wait_list_before_run_handlers: idcsi=9 ts=94319.288169023
 tegra-vi/captur  8696 [003] 94320.323146: tegravi:tegravi_syncpt_process_wait_list_exit: idcsi=9 ts=94319.288169023
 tegra-vi/captur  8696 [003] 94320.323157: tegravi:tegravi_capturewait_exit: port=1 idcsi=9 syncpt=11029911 error=0
============================== FROZEN CPU OR PERF RECORD =====================================
 tegra-vi/captur  8697 [001] 94320.505912: tegravi:tegravi_syncpt_process_wait_list_enter: idcsi=9 ts=94319.288169023
 tegra-vi/captur  8697 [001] 94320.505922: tegravi:tegravi_syncpt_process_wait_list_before_run_handlers: idcsi=9 ts=94319.288169023
 tegra-vi/captur  8697 [001] 94320.505925: tegravi:tegravi_syncpt_process_wait_list_exit: idcsi=9 ts=94319.288169023
 tegra-vi/captur  8697 [001] 94320.505934: tegravi:tegravi_capturedone_exit: port=1 idcsi=9 syncpt=11029911 error=-11

The perf tool doesn’t complain about any error :

[ perf record: Woken up 34 times to write data ]
[ perf record: Captured and wrote 8.889 MB /mnt/tmp/perf_record.2 (~388366 samples) ]

dmesg :

[94321.260945] vi vi.0: wait: CSI_A syncpt timeout, syncpt = 11029914, err = -11
[94323.184974] vi vi.0: wait: CSI_A syncpt timeout, syncpt = 11029916, err = -11
[94323.392966] vi vi.0: wait: CSI_A syncpt timeout, syncpt = 11029918, err = -11
[94323.601964] vi vi.0: wait: CSI_A syncpt timeout, syncpt = 11029920, err = -11

@Romary
Appreciate for your finding. Could you attached the binary of stress-ng here.

stress-ng is opensource, and easy to compile on tegra ubuntu R21.5:

https://github.com/ColinIanKing/stress-ng

git clone https://github.com/ColinIanKing/stress-ng
cd stress-ng
make -j4
./stress-ng  --sched fifo --sched-prio 49 --vm-rw 8 --vm-rw-bytes 300M -t 60 --metrics-brief

cyclictest is opensource, comes from the RT patch world , and easy to compile on tegra ubuntu R21.5:
https://rt.wiki.kernel.org/index.php/Cyclictest

git clone git://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git 
cd rt-tests
git checkout -b stable/v1.0 --track origin/stable/v1.0
make -j4
./cyclictest -t1 -p 95 -n -i 10000 -l 10000 -D 100

The main goal is to measure impact of system load having RT priority 49 / 1 / normal scheduling (with stress-ng) to a process scheduled in RT priority 95 (cyclictest or syncpt/vi cpature kthread).

On x86, there is also a high impact on maximal latency reported by cyclictest where stress run with RT prio 49 or even 1, but all drivers are running perfect.
What wrong with tegra vi driver architecture is that any high cpu / memory system load will generate syncpt timeout.

Our application does lots of computing, not in a fixed load maner, and may lead to short very high system load (like stress-ng), causing V4L frame grabing to stop.

We have 4questions:

  • What are the timing requirement on the following syncpt/ vi register access ( any datashhet pointer ?) , in milli or micro seconds of latency [ max allowed value ] see post #20 and # 18
  • Does last release of Tegra X2 and Tegra X1 kernel source have the same VI driver achitecture ( IRQ=> syncpt WQ => kthreads ) ?
  • Why Syncpt Workqueue jobs ( code that access to register) is not integrated to the IRQ handler function?
  • Why vi capture done / wait+start jobs ( code that access to register) is not integrated to the IRQ handler function?

Without any RT patch, the only way to get a derterministic behaviour of any driver in high system load condition is to keep code inside the IRQ handler function.

@Romary

  1. I am not really understand what your question. However for the syncpt you can check the TRM chapter 11 "Host subsystem’
  2. Yes, TX1/TX2/TK1 have the same design for VI driver.
  3. Syncpt like interrupt. Please have detail information from TRM
  4. like above 3

@Romary
I try to repo by run the cyclictest and stress-ng during this period run the yavta to capture sensor data by yavta. Current I can’t repo the syncpt timeout. How long to repo it at your side?

./cyclictest -t1 -p 95 -n -i 10000 -l 10000 -D 100
./stress-ng --sched fifo --sched-prio 49 --vm-rw 8 --vm-rw-bytes 300M -t 60 --metrics-brief
./yavta /dev/video0 -c20000 -n1

Indeed , we confirme this is not reproducible this way.
We are working on yavta +stress-ng [+ other tools to be shared] based solution to make it reproducible on your side.

With our application, we confirme having a 100% reproducible issue.

We also discovered that some unsed nv* libs for h264 encoding were still linked to our application, so we will clean those stuff, et get back to you with an updated protocol.

Thanks Romary
It’s would be helpful if we can repo with our camera board.

When removing nv libs (for H264) of our binary, we discovered that the USB device Webcam driver don’t want to start USB request transmission using tegra_udc only if the H264 engine is not initialized.

If the H264 engine is initialized before, then USB request are send successfully.

Is there any known limition to tegra UDC driver or any link with h264 engine ?

This issue makes us unable to send you yavta based solution to reproduce the syncpt timeout issue.

Hi Romary, May we know why you remove some libs in your system? We only support/verify the whole system.

Here is the the code that loads and inits the Nvidia NVOMX library.

typedef struct _GOmxSymbolTable{
	int (*init) (void);
	int (*deinit) (void);
}GOmxSymbolTable;

...

void* handle = dlopen("libnvomx.so", RTLD_LAZY);
GOmxSymbolTable s;

s.init = dlsym(handle, "OMX_Init");
s.deinit = dlsym(handle, "OMX_Deinit");
s.init();
s.deinit();
dlclose(handle);

Without this code lines we are not able to start transmission of USB requests using the tegra_udc driver.
With this code, it just works.

Hi Romany, libnvomx.so is critical. Please load it.

For more information, USB cameras can output MJPEG, which requires jpeg decoding in libnvomx.so

Hello,

The syncpt issue has gone since we did some small changes to remove kernel realtime scheduling latencies and added a better realtime priority assignment between threads of our userland application.

This leave the issue unexplained.
We are now running long term endurance tests.

It is quiet surprising to have to load libnvomx.so to enable synchronous USB endpoint in device mode (usb gadget only). In our case we don’t have to encode/decode jpeg since we are sending an already encoded stream, as raw MJPEG. It would be great ro remove this dependancy.

Romary
Did you get good result for the long run test?

Hi every one.

Currently i can connect B102 with J100 and get data 720P60 from it with RGB color space but now i want to config B102 output with YV12 color space for shorter capture time. When i tried to change color space to YV12 in driver, i only grabbed still image, not video. Have any one meet this situation before. Please help me?

Hi haihoangsoftware,
This topic is for TK1. If your issue is not on TK1, you should start a new post for it.

I am sorry for posting it here.

My post related to TC358743 so i hope that some one can help me solve the problem.