we have an application with 3 sensors running at 1920x1080p25 recording video to a USB drive.
We have found that we are periodically dropping frames and it seem to be related to when the disk caches are written to the USB drive. To verify this, we have implemented a method in the kernel (nvidia/drivers/video/tegra/host/vi/vi_notify.c) where we measure the interval between all frames on all channels and prints a kernel error message if its not within tolerance (40ms @ 25Hz).
If we run the following pipeline (only recording two cameras):
nvv4l2camerasrc device=/dev/video0 ! queue ! tee name=t
t. ! nvvidconv ! queue ! nvv4l2h265enc bitrate=20000000 ! h265parse !
nvv4l2camerasrc device=/dev/video2 ! queue ! tee name=t2
t2. ! nvvidconv ! queue ! nvv4l2h265enc bitrate=20000000 ! h265parse !
We get periodic errors:
[ 1038.222653] Channel: 0 time between frames: 1720074368
[ 1043.053603] Channel: 1 time between frames: 840029088
[ 1043.062975] Channel: 0 time between frames: 840036224
[ 1103.855715] Channel: 1 time between frames: 480016640
[ 1103.865545] Channel: 0 time between frames: 440018912
[ 1174.948592] Channel: 0 time between frames: 2680115904
[ 1174.978102] Channel: 1 time between frames: 2720094112
[ 1208.499351] Channel: 1 time between frames: 1320045664
[ 1208.510067] Channel: 0 time between frames: 1360058720
[ 1243.380474] Channel: 1 time between frames: 2720094016
[ 1243.391541] Channel: 0 time between frames: 2680115840
[ 1247.620672] Channel: 1 time between frames: 240008320
[ 1247.631734] Channel: 0 time between frames: 240010336
[ 1277.113038] Channel: 0 time between frames: 1320056864
[ 1277.141689] Channel: 1 time between frames: 1400048448
[ 1311.982836] Channel: 1 time between frames: 2720093952
[ 1311.994484] Channel: 0 time between frames: 2680115712
[ 1346.984044] Channel: 1 time between frames: 2800096736
[ 1346.996066] Channel: 0 time between frames: 2800120960
[ 1380.505277] Channel: 1 time between frames: 1360047072
[ 1380.517486] Channel: 0 time between frames: 1320056992
[ 1452.200604] Channel: 0 time between frames: 3320143488
Here, the time printed should be around 40ms, so on the last line for example we have dropped 3320143488/40000000=83 frames!
Running the same pipeline, but replacing filesink with fakesink results in no errors which leads me to suspect that the IO scheduling somehow blocks the VI input.
Any ideas on how to troubleshoot or remedy this issue?
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=tegra-xusb/3p, 5000M
|__ Port 1: Dev 2, If 0, Class=Mass Storage, Driver=usb-storage, 5000M
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=tegra-xusb/4p, 480M
That verifies it is running USB3.1 gen. 1, which should be fast enough (so far as USB goes). I don’t know if there is a way to increase buffer size for that USB storage device, but if there is, then that might solve the problem. Someone from NVIDIA would have to answer what method (if any) could be used for increasing the USB buffer size.
This makes the write policy more aggressive, meaning that when the actual write happens the buffer is not as large, however this only resulted in having frame drops occuring more often…
The main issue seem to be that the VI interrupt gets shadowed by other interrupts, we would like to have the VI highest priority so that we never miss any frame, we have lots of free memory and the encoding pipeline is not near the performance limit.
We are currently running L4T R32.5, not sure if anything has been improved in later versions?
It is likely the only CPU core handling this is CPU0 since it is a hardware IRQ. This would mean that most hardware is competing for a CPU0 time slice. Are you running with a max performance model?
I don’t know if the USB write inherits from the VI priority or not. If it does, then the priority increase is good, but if not, then you might be getting a priority inversion. I do not know if there is a way to increase USB mass storage write priority, but if there is, then perhaps you could make the USB write slightly higher priority than VI.
If you have software-only programs which might be consuming time on CPU0, then you might be able to reduce CPU0 load by transferring affinity of the software-only process to another CPU core.
I don’t know about other release improvements, but it is possible. There are usually performance improvements with new releases.
It should. However, perhaps there is insufficient buffer?
On the other hand, there is something called “IRQ starvation”. I don’t know if this is the case, but if one core is handling all of the hardware drivers, e.g., for wired ethernet, any Wi-Fi, any PCIe device, any persistent memory device, and a few other things (including USB and one or more cameras), then it is possible that the request to handle interrupts requires a rate in excess of what the core can provide. If for example one driver takes more time, then there is less time available to service other interrupts. Some interrupts are higher priority than others (we have to avoid priority inversions and deadlocks). Some parts of drivers allow multitasking, but other parts are atomic (imagine something like eMMC being in the middle of a transfer; it is quite possible that there is no ability to tell the eMMC to halt transfer and pause while we switch to another task, and as such, that section of code would have to be marked atomic). If the interrupt servicing the arriving data is not serviced prior to the next data frame arriving (and generating a new interrupt), and if the device itself does not have enough buffer for both frames, then you have to drop data. There are so many creative ways for the o/s to fail when you have only one core servicing the hardware IRQs.
Thanks for your thoughts, yeah it might be that the actual USB request to transfer the data is atomic but the times between lost frames we are seeing is sometimes up to 0.5s, that seems like way too long for a os to be doing nothing…
Any ideas on how to move forward or what people could know something about this?
I think that answering this would likely require NVIDIA profiling the hardware (which can take expensive debug hardware). There are things you can do to guess, e.g., increasing the buffer sizes of anything related to this, but an actual answer as to whether it is IRQ starvation or some sort of a tunable priority issue pretty much requires profiling or running in a debug situation with some high end JTAG debugger (or DCELL for newer hardware).
The two things (in general) which could possibly help just by guessing would be (A) more buffer, or (B) increased priority of your application (decreasing its “nice” value).