I can’t answer your question, but I’ll provide some information that you might find useful. This will get long.
First, keep in mind that UDP is designed to throw away packets if anything goes wrong. If packets arrive too quickly, then they are discarded. This is by design. If you don’t want to do this, then you have to go back to TCP.
Often it is said that TCP is ineffecient compared to UDP. That’s mostly due to packet overhead. It is possible to turn off the Nagle algorithm in TCP (which is what re-transmits after missing packets over some time) and actually improve efficiency in some cases, which will eliminate waiting for a resend when things go bad. You need to talk more detail about the nature of the data before much can be said which is constructive.
You could create an app with a large ring buffer, and use shared memory to give read access to another app. The ring buffer would just receive bytes and mark progress for beginning and end of buffer. The other app could read from beginning as far as possible, and advance the “begin” pointer. If the buffer gets filled, then it would just choose to either (A) wait, or (B) continue on and overwrite the oldest data while updating the pointer to start of data. I’ll say more below that will show why this might make sense.
The Linux kernel has a scheduler. That scheduler is what determines which process to put on which CPU core. This is done whenever an interrupt occurs.
There are essentially two kinds of interrupts: Hardware, or software.
Hardware interrupts belong to hardware which has an actual wire to trigger leading to the CPU core. An example would be an ethernet card, or a USB root HUB. When they have activity they activate the hardware interrupt. The scheduler monitors this and determines when and where to send the work.
A software interrupt is independent of wires. Whereas a hardware interrupt requires a wire to the CPU core, a software interrupt is only a virtual logic event. This event is also monitored by the scheduler, and the scheduler still determines priority and where to send the work.
Pay very close attention to this: A hardware IRQ can only go where there are hard wires. Often (at least on a desktop PC with Intel CPUs) there is an asynchronous programmable interrupt controller (APIC) which can alter which core a hardware IRQ can route to. Jetsons tend to not have this, and some hardware must go to the first CPU core (CPU0).
One can set a process affinity to go to a particular core to override the scheduler, but if it is hardware IRQ to a core without the wiring, then it will just migrate back to the original core.
Take a look at this before you’ve run your Jetson for very long with networking:
cat /proc/interrupts
(this is for hardware IRQs)
Let your program run for a few seconds, and then run that same command again. You’ll see IRQs to CPU0 go up. If you were to use software affinity to try to bind this to another core, then it is likely you will see this still go to CPU0. The ethernet depends on a hardware IRQ.
Now go to your command line, and run this command to see soft IRQs:
ps aux | grep 'ksoftirqd'
If this were a desktop PC with older cores and no hyperthreading, then you’d have one ksoftirqd
per core. If you have hyperthreading, then this might end up as two ksoftirqd
per core. Hardware interrupts are handled by the IRQ vector table in the kernel; software interrupts are handled via ksoftirqd
. Both are managed by the scheduler.
If you have a kernel driver for hardware, and if that hardware has both I/O and some sort of software function, then best practice is to normally split the work. For example, if you have a network card, and it is receiving data, plus it is performing a checksum, then good practice would be to have one driver for receiving data, and another driver for checksums. What this does is to allow the minimal amount of time locking a CPU core to a hardware IRQ, and then allowing the software IRQ to migrate to any other core.
The scheduler does not always migrate content even when it could. The scheduler has some knowledge of the cache, and of priorities of other processes and/or threads. If one has two programs using the same data, then switching cores would imply a cache miss; staying on one core would imply better performance due to cache. It isn’t always better to migrate to another core.
However, if CPU0 is doing a lot of hardware work, then it might be good to migrate to another core for some software interrupt. You can manually force migration with core affinity. The scheduler might ignore that request, but if it is a software IRQ, then it is likely the scheduler will do what you tell it to do (even if it breaks cache).
The reason it might be of interest to have one program which has that circular ring buffer, which is separate from the program which uses shared memory to access that buffer, is that the reception of the data is hardware IRQ dependent, but the program using the data could run on a different core. This would significantly complicate life though.
I’ll reiterate that a lot needs to be known about the data to answer any closer than this vague description. Can data frames be thrown away without issue? Is this all on a LAN, and thus not so likely to see packets out of order? What else is running on the core? How large is an individual data packet? Does one packet send more than one “line” of data? Does it take multiple packets to output one “line” of data? Are you using jumbo packets? Under ifconfig
, what MTU shows up for that interface at both ends?
I’m not very good with GPU, and definitely not useful with more advanced Python, I tend to go to C or C++. I can’t answer Python questions on the topic. There are, however, a few experts on the GPU here, and there are some GPU shared memory possibilities. Describe a lot more about the nature of the data.
As a final mention, Jetsons have various power modes. Make sure you are in the max performance mode. You are at a disadvantage if your system is not using all cores, or if the clocks are being throttled to consume lower power.