The question of “sequential” is not about whether packets are being sent sequentially. Networking itself can cause them to arrive out of order. On a LAN this is less likely to occur. The more hops in the route or the more congested a route is (which implies any other traffic, not just yours), the more likely something will arrive out of order. This is one reason people might sometimes pick TCP.
However, it leads to the question: Can you tolerate lost packets? What happens if two packets are indeed out of order?
What size is the data? Is it always the same size? Is the data “small” relative to the packet size? Would you be better off accumulating a larger packet, or would you prefer a lot of smaller packets (with perhaps lower latency)? What kind of tolerance is there to latency? The MTU question still applies regardless of using UDP or TCP. It would be nice to know more about the data and consequences of imperfect networking. Even knowing you are on a LAN that is dedicated to this one task helps (there is limited bandwidth, but average bandwidth is not the only question).
Keep in mind that there is network overhead. I’m assuming IPv4. UDP has an overhead of something like 8 bytes (4 fields of 2 bytes). Let’s say you double the size of data for one packet…your overhead is cut in half. Let’s say you send two packets when one would do…then the packets need to be reassembled as well, which takes CPU power.
Normally, if data is ready such that the data plus header exactly equals the packet size (including header overhead), then the packet sends immediately. If the data is too large, then the data is fragmented, and then sent in multiple packets. If the data is too small, then the packet might wait for more data before sending in the hopes of including that data in a full packet; if a timer is reached before the packet is full, then the packet is sent anyway after that timer delay. I think the max MTU of the Jetson hardware/software is 9000 bytes (or close to that, including overhead of headers; both sides of the connection, and intervening route hops, may also limit this).
Overall I think the “average” bandwidth, if you only receive that data, is within limits. But is it all actually being consumed out of a small buffer before it is overrun?
Incidentally, if you run ifconfig
after some networking has run, take a close look at the “RX” (receive) side statistics. Do you see any dropped or overruns? If not, then you are receiving (also check the sending end’s TX for the same). Incidentally, all of that traffic is handled by CPU0; DMA only helps for part of this. If CPU0 is loaded from other processes, then the dropped data can go up.
Alternate: If ifconfig
is deprecated, then you can run “ip -s -a addr
”.
Every packet will generate a hardware IRQ. When the CPU cannot respond in time (the network is not waiting for the Jetson when using TCP) due to being busy with other IRQs, then you have “IRQ starvation”. I don’t know if IRQ rate is saturated or not, but in theory, the networking alone is generating 32000 IRQs each second, plus the eMMC, plus any external storage, plus other internal devices, so on. When you are stuck on the first core (unlike a desktop PC, I am assuming there is no wiring for a hardware IRQ to other cores), then IRQ starvation starts to matter, especially with UDP since the protocol will gladly drop the data if the CPU core is not ready. DMA may mitigate this, but it won’t eliminate the issue since the header and protocol triggers an IRQ to service the data even if the transfer of non-overhead to some buffer is via DMA (this would certainly decrease the time the hardware IRQ is held by that packet, but you would still need a hardware IRQ).
I can’t say much on the GPU, but do realize this is using shared memory with the CPU (this is not a discrete GPU with its own memory). Maybe @dusty_nv or one of the other people who program CUDA could answer what the most efficient way is to share data with the GPU. It could be that the GPU is not consuming the data fast enough. There might be merit in copying a lot of network data to a buffer, and only then sending it to the GPU (don’t know). Perhaps the GPU itself requires CPU0 to get that data.
What is it that made the decision to send 1106 bytes at a time? Sending enough to fill a 9000 MTU (keep in mind some is overhead) might reduce the load. It depends on what latency you can live with, but a dropped packet is perhaps more of an issue than a millisecond of latency. If the data is being generated at a high enough rate, then perhaps filling an MTU 9000 buffer has almost no penalty while greatly reducing the hardware IRQ rate on CPU0.
Your other programs could be told to live on another CPU core, e.g., CPU1 or CPU2. This is the topic of assigning cgroups and then setting affinity:
- https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
- https://forums.developer.nvidia.com/t/irq-balancing/126244/6
- https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/cpusets.html
- https://www.kernel.org/doc/Documentation/kernel-per-CPU-kthreads.txt
- About
taskset
: https://wiki.linuxfoundation.org/realtime/documentation/howto/tools/cpu-partitioning/taskset
You’d want to move processes which work with the data, but which are not tied to a hardware IRQ, to a core other than CPU0. You don’t necessarily want everything on separate cores since it depends on cache (which can hit or miss) for performance and forcing a cache miss can cost performance.
There is also a possibility of increasing priority of a critical process. You have to be very careful about overdoing this since you can easily break the system. Let’s say you have a program which receives the data and puts it in a buffer. The default priority is a “nice level” or “niceness” of 0. Being “nicer”, or having a higher nice level, means lowering the priority of that process. What you want is to “not” be nice and to hog the CPU. That means a negative niceness. The highest priority is a nice level of -20
. You would destroy the system with that.
If you have a process on CPU0 which is reniced to -1, then other processes which are critical, e.g., the disk drive/eMMC driver, would hardly notice. You wouldn’t break storage. However, your other programs, which have a nice of 0, would not step on the process that fills the buffer (at least not as often). There are a lot of processes running that you don’t even see, and you’d have a slight advantage there.
Now let’s say you also have a program running on the GPU which pulls the data from that buffer. You might want to increase that program’s priority to a nice level of “-1
”, and then bump the buffer accumulation program to a niceness of “-2
” so it receives data before the GPU tries to process it. Now both programs have a slight advantage over other “average” (non-privileged) programs. You can’t really benefit from going far down that path because by the time you get to -5
you are going to have some issues with the operating system. Also, if you have a higher priority already over the non-critical processes, then you’ve already done what you can.
Note that a nice level in the negatives requires root authority. See “man nice
” (on a PC, the Jetson won’t have the man
pages) and “man renice
”. One can start a program with a different nice level, or bump it while it runs. In fact you might want to let it run, observe performance, and then see what happens when you renice the running PID to -1
or -2
.
Incidentally, if the process is on a different core than CPU0, then you have more freedom to renice. However, if you go too far negative (higher priority), then you probably want to make sure that process has affinity for some specific core and never migrates.