Replying as I read, might result in repeating or being out of order from the best reply…
When you say the “lsusb -tv
” does not differ, is this an exact match between conditions whereby the speed listed at the end of the line (e.g., “5000M
”, “12M
”, so on) do not change? Knowing that the root HUB and tree itself has maintained settings is useful information. This would tend to mean that device modes have not fallen back to different speeds (you’d have to monitor for speed changes since the fallback is more or less not visible without specifically checking; dmesg
would of course note whenever a device reenumerates to change speed).
“lsusb -tv
” should be considered the authority on whether two root HUBs are used, or just one; this is also the authority on what that root HUB is processing. If you have two devices of speed 5000M
, and the root HUB is 10000M
, then you have enough bandwidth at all times (latency might still differ between a single or two 5000M
devices, but it wouldn’t be significant compared to “normal” operation). Having more root HUBs is not guaranteed to have everything go to the correct USB port; the correct wiring and device tree are required to reach that HUB combination by everything you want in that combination. Again, “lsusb -tv
” will describe what actually exists at any moment. So far it sounds like there is no change at all in your “lsusb -tv
”, which is important information.
Note that older USB standards, from USB 2 (“480M
”) and slower used a single controller (root HUB) for older protocols up to and including USB 2. USB3+ HUBs run independently, and for a USB3 device to fall back to USB2 or slower the root HUB must actually migrate from the USB3 root HUB to the legacy HUB. At that moment “lsusb -tv
” will change. An example would be that you plug in something which is rated for USB3, but the cable has insufficient signal quality; initial plugin would show routing to the USB3 root HUB, and a moment later “lsusb -tv
” would instead show routing to a legacy HUB. I’m basing all of this on no migration, which means signal qualities are not a part of the problems (power requirements could also cause such a migration, but this too is covered by the non-changing “lsusb -tv
”).
Trivia: FYI, an Intel CPU on a desktop PC has what is known as an I/O APIC (Asynchronous Programmable Interrupt Controller). That I/O APIC changes to which CPU core an interrupt goes to, and tends to be programmed by the scheduler depending on policies. AMD does something different, but similar, for its desktop CPU interrupts. Only a small part of the Jetson devices can be migrated to different controllers. If you run this command you can see the hardware interrupt counts, some statistics, and descriptions:
cat /proc/interrupts
The above is a lot of output, you could use “less /proc/interrupts
”, but the numbers change each time you look, so “cat
” is good for any instant, and “less
” is good for viewing a snapshot at a given time.
The scheduler does not necessarily migrate different processes to different cores just to see a balance in CPU load. It is a misconception that splitting everything evenly among cores will cause faster software execution. Desktop PC architecture, and ARM Cortex-A architecture (which is what a Jetson uses) takes advantage of a lot of caching. There are different levels of cache (which cost more or less money and power consumption) at different parts of the CPU memory access. If one takes two independent programs and runs them on two different cores, then the operation would in fact be faster by splitting them to different cores; if, however, we are talking about processes or threads which share some data, and if splitting them across cores implies both needing to access a different cache on a different core, and then to update the other’s cache, there would be a lot of cache misses, and performance would suffer horribly. The scheduler tries to take this into account, but the scheduler might not understand what the user intends.
Jetsons lack the ability for many hardware IRQs to migrate…the wiring simply doesn’t exist. In “/proc/interrupts
” you will see a lot of drivers serviced only at CPU0. You could in fact use CPU affinity and try to move those drivers for that hardware to another core. The scheduler would see this, and it would seem to operate, but what would happen is that the process would be migrated by the scheduler. You’d add to the CPU load and latency because the scheduler would try to put this on an invalid core, realize it cannot, and then migrate it back to CPU0 anyway.
However, there are cases where each core has access to a given hardware, e.g., every core has its own timers. There are also cases where entire groups of hardware can migrate, but not individual parts of it (the GPIO controller I think is set up this way; you can’t migrate a specific GPIO to a different core, but I think you can migrate a group of GPIO, a GICv3 device on Orin).
Note at the bottom of “/proc/interrupts
” there is a “Rescheduling interrupts
”. This can be from trying to route an interrupt to an invalid core, but there is a lot more which can cause this. One process might be lower priority and the same core is used, but execution time is delayed. The most interesting is the “Err
” for errors, which should be zero.
Setting your Jetson for max performance with nvpmodel
is the most reliable and easiest way to improve latencies (at a slight cost to power consumption). This won’t change default scheduling policy, but it will keep clocks up higher and allow more power consumption and somewhat higher temperatures.
For what follows you probably want to be a bit pickier about the names “process” and “interrupt”. A process ID (PID) is a user space notion, whereas a hardware IRQ is entirely in kernel space. They are both ways of identifying which software goes where, but one would tend to speak of hardware drivers via their IRQ and user space software via their PID. Note that software which operates on particular software operates on the PID, whereas changing the priority of a driver is not the normal situation. However, if a given PID uses a particular part of the hardware which has a given driver on a given IRQ, then changing the scheduler priority of the PID can indirectly change the IRQ priority. The scheduler is not obligated to honor requested changes (if this were hard realtime hardware that would not be the case…ARM Cortex-R hardware is capable of running guaranteed timings and never ever missing).
If you have a program which runs two cameras together, then increasing the priority of the program could (and often will) change the priority of the kernel driver whenever a hardware IRQ is issued. Both cameras would be on one cable, but technically are separate devices unless the internals of the stereo camera has hardware to synchronize timings. It is good if the cameras or devices self-synchronize for stereo; if you tried to get the two cameras to synchronize in timing by controlling the driver at the hardware IRQ level you’d fail. You could tune this and improve it, but ultimately you couldn’t do anything to make this deterministic on this hardware (two separate cameras in stereo would be inferior to on physical device with two cameras that have internal shutter timing sync since the latter can give you deterministic operation of both cameras at the same instant in time even if the hardware IRQ timing splits to two IRQs with slightly different times).
When you are talking about process via a PID the scheduling “pressure” this is the “nice” number. It is called nice because the higher the number the nicer this process is to letting other processes push it out of the way. A nice value of zero is default for user space programs. Anyone can set their program to be “nicer”, but if you want a higher priority (a “negative” nice number), then you must use root authority. Is there a single program which runs the two cameras? Perhaps it runs other things as well, which makes this not work so well, but you could renice
(that’s an actual program name) your program to something like a nice
level of “-2
”.
Beware that you should not simply give processes higher priority at random, there can be unintended consequences, e.g., priority inversions. See:
With those you could either start your program with a priority nice
of -2
, or migrate it to that higher priority. If you go to -5
you’ll probably have unintended consequences. Note that if your program is just one less nice than another, then making it even more negative in nice
value isn’t going to help. Regardless of how much priority you give your program, if the CPU core doesn’t have time to do what it needs for both cameras, then more priority isn’t going to magically cause the core to be faster.
For drivers one can also take a program (and indirectly the driver running it; we’re talking to the scheduler) and bind it to a specific CPU core. You won’t have much help from this since I think the wiring you need to migrate the USB to another core doesn’t exist. However, for reference this is CPU “affinity”. One can mark a program or process with a cgroup, and then use various methods to assign that cgroup
to a specific core (which will promptly reschedule to CPU0 and waste time going back to CPU0 when no wiring exists to go to your favored core). However, the way CPU affinity could help, is that there are likely a number of software processes (things not requiring wires to hardware run on software IRQs…“soft” IRQs), and soft IRQs can migrate to any core at any time (once again, there may be performance problems from cache misses by doing this). This means you could find software processes on CPU0 and offload them to another core to allow hardware IRQs to work on a less loaded CPU0. I don’t think this is going to be of a huge benefit to you though due to the bandwidth USB3 stereo is pushing.
When building hardware drivers it is considered efficient if the hardware IRQ does the minimum possible work (taking the least time), and then completes any work via reissuing a software IRQ to a different driver. This means the “wired” connection locks the core for less time. Maybe the software IRQ even runs on the same core in order to try to take advantage of cache hits, but it does give the opportunity to preempt and run smoother (parts of your hardware IRQ will be atomic, preempting might be denied for a short time; those times add up).
Regarding your specific “lsusb -tv
”…
Your root HUB at bus 2, port 1, device 1, has 10000M
bandwidth. The first two items on that root HUB use 5000M
bandwidth. This has completely used the bandwidth (many devices don’t transfer data 100% of the time, so it might not actually be that bad, but often it is). Then I see another HUB which is 10000M
consuming from that original root HUB. If that HUB runs at the same time as both cameras, then you are guaranteed to have traffic congestion. Just because a HUB is 10000M
mode doesn’t mean it will actually use that much bandwidth, e.g., if you plug a mouse into that HUB, then consumption is trivial.
However, if you look at that other 10000M
HUB (bus 2, port 3, device 2) from Realtek, I am unable to know what those devices actually are. Most of them are 5000M
, and so even if this other non-root HUB had its own separate root HUB, then that HUB all by itself has consumers exceeding its 10000M
bandwidth. This 10000M
bandwidth is being consumed (for your case ) on a root HUB that already has 10000M
from the USB Video class cameras. There is no possibility that USB can provide enough bandwidth if all devices operate at the same time (some devices might buffer and burst, and sometimes work, but on average it will fail).
My conclusion is that most of your problems are from having is due to running everything from a single root HUB which is not even remotely capable of handling all devices simultaneously. That particular “lsusb -tv
” does not show the existence of a second USB3 root HUB.
Root HUBs will not automatically route to a given port. You have to (A) have the drivers present (you must have that for one of them to show up), and (B) the drivers have to know to use that port (a device tree setting; this is the firmware telling drivers about binding to hardware at given addresses), and (C) the actual wiring has to exist. For example, if you routed only the legacy USB2 wiring, then no matter what you do, the USB3 root HUB won’t show up. If you did everything correctly on the schematic, but didn’t properly bind things together with the device tree, then the hardware will appear to be missing. If you have the wiring for a USB3 10000M
root HUB which you do not see, then it is likely your device tree is incorrect.
If you could actually get a second root USB3 HUB, then you’d still be at the edge of your bandwidth, but you would stand a chance. Currently, giving drivers higher priority would not help, at least not enough to solve the problem. I suggest examining your USB3 wiring and device tree to get a second root HUB of USB3 speed to show up. Perhaps you truncated your “lsusb -tv
”, but if not, then there is no chance this will succeed after the first two cameras if you don’t cut down resolution.