The max priority of -4 is mainly just from experience. If you have system processes which are less priority than the -4, then your priority will tend to run first (or in hard real time will run first). You’re not really supposed to be competing with system processes, but instead competing with user space processes. Let’s say for example that your process uses disk access, but your process has a higher priority than disk access, then all of a sudden you instead slow down due to a priority inversion (and the prerequisite process will wait until it is very late…e.g., a timeout…or until something unblocks it). Even if your process does not directly block the system in an outright priority inversion it might do so for a library call which in turn gets blocked, or some other kthread which is a side-effect of running your process at a higher priority. I’ve found that there is rarely a problem at -4, and you can end up changing latencies and averages too much when you arbitrarily go to too high of a priority. Also, the scheduler can sometimes learn based on process pressure depending on the scheduler, so it is a bit of an art and not entirely science. I also have not found a priority more than -4 actually helps much…if it isn’t about competition, and is just a case of inefficient code, then you are just making the inefficiency run more often which won’t fix it.
Something I haven’t looked at is whether networking is polled or on a hardware interrupt, so don’t take the following details with too much guarantee, but it will still have an idea behind it which is very likely relevant and correct. When it comes to a kthread typically they will have a polling rate on a timer for some of them at 1000 Hz (some might be triggered at a different rate by some other process or thread, it is the scheduler with the final authority on running). Hardware itself tends to be from a hardware IRQ (typically bound to CPU0; you can get hardware IRQ starvation when flooded with hard IRQs to delay this). Suppose you get a priority inversion with networking; you could end up making networking time out or wait longer because of another process of higher priority.
Networking itself has some fascinating possibilities. Please tell me what network protocols are used for this, e.g., is it TCP, UDP, multicast, etc.? Networking tries to be a efficient or “nice” to upstream and/or downstream nodes. In most cases you end up with some buffer with a chunk of data, and the chunk is sent in a burst. This can be the actual payload, or it can be a fragment of a payload which the other end will try to keep in order and reassemble before passing on to your user space program (or to most of the kernel space). TCP tries to help with this, but I’m guessing that if this is on a local network it might be UDP. The gist of it though is that if a buffer being sent does not reach a certain size, then at some polling rate the network send will wait in the hopes of getting more data before sending and reducing fragmentation and overhead of too many small packets. There would be a delay. Packets which are too large for the frame are fragmented into subsets of the data, and those subsets which fill the buffer are sent immediately without waiting. The final buffer of fragments for data needing fragmentation can still wait just like the individual frames which are too small and which wait for a timer (they would have been sent if more data arrived to fill the buffer, but it is the last buffer of the fragments and might not be less than the exact size of the buffer).
Keep in mind that if you add something of higher priority, then that wait timeout for sending small packets can increase.
In TCP/IP there is a rather fascinating set of possibilities, although it depends a lot on your actual data. We would need to know far more about data size and network settings. The MTU (maximum transmit size) and MRU (maximum receive size) work together, but MTU is the one which might wait to send a small amount of data. There are lots of ways to look at network metrics, and unless we are talking about your specific data with your specific protocol it may not make sense to go into much detail, but consider checking MTU and queue length with “ip link show <optionally name the specific interface>”. Or perhaps looking at a fragmentation count using “netstat -s” before and after one of your tests (you’re interested in see how fragmentation changed during the test since it is a count of fragmented packets).
iplink show (compare MTU minus overhead to your data send size).
netstat -s (look for increases in fragmentation after running your test for a few minutes when latency has gone up).
A perfect solution is when your data is the exact size of what the buffer wants to be filled, including any size for overhead being accounted for. In turn that size would be better if it is exactly the size of the next hop in the route. Sometimes adding unused NULL bytes to the end of data to fit that size, and then dropping of those NULL bytes at the other end, is faster than sending less data since less data won’t cause an immediate send (more data can produce less latency).
If you were to use some old text-based application to send and receive typed in messages, e.g., the old IRC or some old text-based games from the 1990s, and if your MTU was 512 or some multiple, then you’d have more latency than if you were to drop your system’s MTU to 296 bytes (I’m looking for the size which includes overhead of 40 bytes since it is TCP; payloads tend to be even powers of 2, and so 256+40=296). Because the latency of data sent is more important in this case than average throughput it is better to use the smaller MTU in hopes of filling the buffer before a timeout send. More packets implies less efficiency but better latency results.
There is the case for sending larger frames of data. Fragmentation and reassembly has a lot of possibilities for problems. Data fragments may arrive out of order and some might need to wait for others to reassemble. Some fragment might be lost and either the entire data payload retransmitted or at least a fragment retransmitted, which is an enormous hit on latency and efficiency. If we are sending data in bursts of 20,000 bytes, then a 1500 byte MTU will get a lot of fragmentation and then reassembly at the other end. There will be lots of checksums involved in kernel space (some network hardware has no overhead to the Linux kernel when checksums are performed by the NIC itself). In that case one might be better off enabling jumbo frames with a 64k byte size, and thus sent as a single packet. On the other hand, if the next hop in the route does not support jumbo frames, then you will just get the same fragmentation over the network route and jumbo frames might not really help at all. When you are using a private network with communications from host to switch to host, and no intervening hardware, then you can control this and use jumbo frames at all ends (assumes the switch supports that). You’d still possibly have slight latency added waiting to fill a 64k buffer, but it would less than fragmenting several times and reassembling. Better yet, in this case, maybe use jumbo frames in combination with NULL byte padding to fill a frame and send a frame immediately not waiting to possibly add more to the buffer for efficiency.
What is “best” depends on so much it is hard to fine tune without exact details in networking. MTU, if reduced, will always reduce a packet size, and perhaps improve latency in some cases. MRU is up to the other end, and if MRU is violated, the packet might be completely discarded. Most of the time a node uses the same MRU as the MTU. Consider though that even if you enable jumbo frames, if the other end has an MRU of less than that, that the jumbo frame is going to be fragmented before sending despite jumbo frames being enabled. MRU and MTU work together and you will use the minimum size between the two.
Between hardware IRQs on networking, soft IRQs which might be involved in things like checksums and reassembly or fragmentation, it is really easy for another process with a higher priority to change things completely in how networking behaves even if your particular process does not seem to be related; priority among user space processes isn’t such a problem, but you have to use the root user to set a priority higher than 0 (more negative “nice”, e.g., “-1”), and there is a reason for that. Once you get into those higher priorities you are competing with kernel space and not just user space.
If a priority of “-4” does not improve things, then it is likely something else needs to be considered and that competition for resources is not the cause of the latency.
Regarding affinity keep in mind that normally it is the scheduler which determines what runs and when something runs on any give core. An RT kernel is not magic, and what really changes is the scheduler algorithm. The rules for what to send where can be made more absolute with an RT scheduler, but someone has to actually tune that for it to matter. A normal scheduler will succumb more to “pressure” from a low priority process which has been delayed longer and longer such that it eventually runs even if it is a lower priority. RT can give hard assurances, but that is only in software, it doesn’t change the hardware unless we are talking about an ARM Cortex-R core.
Your normal ARM Cortex-A core (or a desktop CPU from Intel or AMD) has cache, probably at multiple levels. Your scheduler is aware of this. Sometimes a process will have many threads, and if you have 8 or 12 or more cores, it might look like it would run faster by putting each thread on a different core, but this is rarely the actual case; it depends on data. Every time you migrate from one core to the next you will get a cache miss. Any time you stay on the same core you might (probably in a lot of cases) get a cache hit. Cache misses cost a lot of time. Typically the scheduler will try to run threads of a process on a single core to try and get cache hits. If you know this is not a problem, then putting a process or thread on a new core might help. That’s only true though if the scheduler is not forced for some reason to migrate back to the original core.
CPU0 is a special core on Jetsons. This core has the wiring for IRQ of any hardware (hardware IRQs need wiring to a core to send an interrupt to that core; desktop PCs have either a IO-APIC or equivalent programmable interrupt controller to change where a hard IRQ routes to). There is a lot of hardware which has no routing to any other core, and you could set affinity to one of those cores, but if a hard IRQ is observed, then the scheduler must migrate back to that original core.
The file “/proc/interrupts” is a list of current hardware IRQ statistics. It isn’t a real file, it lives in RAM and is updated in real time. You could look at it with something like “less /proc/interrupts” and see a snapshot. You will notice hard IRQs for timers on all cores. Every core always has a timer, this allows polling on that core. However, you’ll notice an overwhelming number of hard IRQs on CPU0. To some extent this is also true on a desktop PC, but a PC with the IO-APIC (or equivalent) has many more tricks up its sleeve that a Jetson won’t have when it comes to direct hardware interraction. The desktop PC is also trying to maximize use of cache, so it isn’t entirely different, it is the same scheduler.
If you’ve picked up the RT kernel, then you have a shiny new scheduler! If the hardware allows it, then the RT scheduler allows you to do things to more or less guarantee some events. In a Cortex-A core this is not going to be a guarantee because of cache hits and misses, but it does add some control by tuning priorities (this is not automatic). The PC has better use of the RT extension when you have an IO-APIC or equivalent as it won’t migrate back to the original core when the new core lacks hardware routing to the new core.
It is indeed a possibility that your processes pulling data out of a kworker thread are bottlenecking, but I would definitely not consider a simple increase of priority in your program is the issue, there is an enormous chain of priorities. Is the kworker thread your thread, or is it part of the system? Solving this can differ depending on the answer to that. Profiling can give you better answers, but in kernel space that’s a far more difficult task then in user space. Looking at networking, do you have a way to tell if the data being fed is not being delayed by latency?