UDP packets receiving

Hi!

I’m a new developer of AGX Xavier.And I was asked to handle UDP packets, which has a large amount of data, about 70000 packets per second.I tried to handle these packets using python.I created three processes , in one of them receiving the UDP packets and the other two processes managing other works.But it didn’t work well, as sending 70000 packets per second from PC, while only receiving about 15000 packets on Xavier.And I checked the working status of Xavier by jtop, finding that only three CPU cores working over 90% efficiency, while the other cores woking around 20%, which demonstrates that I didn’t fully utilize the system performance.But I don’t know why this would happen, as the scheduling work of the core is completed by the system, rather than my programming.

As it maybe the CPU that limited the performance of receiving, I thought of DMA to reduce the workload of the CPU.As I would do the computing task on GPU.So I want to directly store the UDP packets to GPU.I once found that there is a shared memory between CPU and GPU in Xavier.Sorry, I couldn’t offer the link as I forget to keep it.So I want to build DMA between the network port of AGX Xavier and the shared memory.But I’m confused about how to set the address of the network port and the shared memory.If there is a similar example you could provide, I’d appreciated it a lot.If I’ve got the wrong idea, there is no shared memory.Then I would change it into GPUDirect RDMA as I’ve found an example on https://github.com/NVIDIA/jetson-rdma-picoevb.Or they are just the same thing.I’m a little bit confused about these memory.

And I was told that maybe using the terminal to receive UDP would be faster.But I don’t know how to run a receiving process while running my computing python project through the terminal parallelly.

I think maybe I didn’t fully utilize the powerful Xavier, as I’m just a new developer.Xavier is able to handle the image or video transforming task which may have a larger workload.If you have any other useful advice which would help me manage the task, please let me know.

I can’t answer your question, but I’ll provide some information that you might find useful. This will get long.

First, keep in mind that UDP is designed to throw away packets if anything goes wrong. If packets arrive too quickly, then they are discarded. This is by design. If you don’t want to do this, then you have to go back to TCP.

Often it is said that TCP is ineffecient compared to UDP. That’s mostly due to packet overhead. It is possible to turn off the Nagle algorithm in TCP (which is what re-transmits after missing packets over some time) and actually improve efficiency in some cases, which will eliminate waiting for a resend when things go bad. You need to talk more detail about the nature of the data before much can be said which is constructive.

You could create an app with a large ring buffer, and use shared memory to give read access to another app. The ring buffer would just receive bytes and mark progress for beginning and end of buffer. The other app could read from beginning as far as possible, and advance the “begin” pointer. If the buffer gets filled, then it would just choose to either (A) wait, or (B) continue on and overwrite the oldest data while updating the pointer to start of data. I’ll say more below that will show why this might make sense.

The Linux kernel has a scheduler. That scheduler is what determines which process to put on which CPU core. This is done whenever an interrupt occurs.

There are essentially two kinds of interrupts: Hardware, or software.

Hardware interrupts belong to hardware which has an actual wire to trigger leading to the CPU core. An example would be an ethernet card, or a USB root HUB. When they have activity they activate the hardware interrupt. The scheduler monitors this and determines when and where to send the work.

A software interrupt is independent of wires. Whereas a hardware interrupt requires a wire to the CPU core, a software interrupt is only a virtual logic event. This event is also monitored by the scheduler, and the scheduler still determines priority and where to send the work.

Pay very close attention to this: A hardware IRQ can only go where there are hard wires. Often (at least on a desktop PC with Intel CPUs) there is an asynchronous programmable interrupt controller (APIC) which can alter which core a hardware IRQ can route to. Jetsons tend to not have this, and some hardware must go to the first CPU core (CPU0).

One can set a process affinity to go to a particular core to override the scheduler, but if it is hardware IRQ to a core without the wiring, then it will just migrate back to the original core.

Take a look at this before you’ve run your Jetson for very long with networking:
cat /proc/interrupts
(this is for hardware IRQs)

Let your program run for a few seconds, and then run that same command again. You’ll see IRQs to CPU0 go up. If you were to use software affinity to try to bind this to another core, then it is likely you will see this still go to CPU0. The ethernet depends on a hardware IRQ.

Now go to your command line, and run this command to see soft IRQs:
ps aux | grep 'ksoftirqd'

If this were a desktop PC with older cores and no hyperthreading, then you’d have one ksoftirqd per core. If you have hyperthreading, then this might end up as two ksoftirqd per core. Hardware interrupts are handled by the IRQ vector table in the kernel; software interrupts are handled via ksoftirqd. Both are managed by the scheduler.

If you have a kernel driver for hardware, and if that hardware has both I/O and some sort of software function, then best practice is to normally split the work. For example, if you have a network card, and it is receiving data, plus it is performing a checksum, then good practice would be to have one driver for receiving data, and another driver for checksums. What this does is to allow the minimal amount of time locking a CPU core to a hardware IRQ, and then allowing the software IRQ to migrate to any other core.

The scheduler does not always migrate content even when it could. The scheduler has some knowledge of the cache, and of priorities of other processes and/or threads. If one has two programs using the same data, then switching cores would imply a cache miss; staying on one core would imply better performance due to cache. It isn’t always better to migrate to another core.

However, if CPU0 is doing a lot of hardware work, then it might be good to migrate to another core for some software interrupt. You can manually force migration with core affinity. The scheduler might ignore that request, but if it is a software IRQ, then it is likely the scheduler will do what you tell it to do (even if it breaks cache).

The reason it might be of interest to have one program which has that circular ring buffer, which is separate from the program which uses shared memory to access that buffer, is that the reception of the data is hardware IRQ dependent, but the program using the data could run on a different core. This would significantly complicate life though.

I’ll reiterate that a lot needs to be known about the data to answer any closer than this vague description. Can data frames be thrown away without issue? Is this all on a LAN, and thus not so likely to see packets out of order? What else is running on the core? How large is an individual data packet? Does one packet send more than one “line” of data? Does it take multiple packets to output one “line” of data? Are you using jumbo packets? Under ifconfig, what MTU shows up for that interface at both ends?

I’m not very good with GPU, and definitely not useful with more advanced Python, I tend to go to C or C++. I can’t answer Python questions on the topic. There are, however, a few experts on the GPU here, and there are some GPU shared memory possibilities. Describe a lot more about the nature of the data.

As a final mention, Jetsons have various power modes. Make sure you are in the max performance mode. You are at a disadvantage if your system is not using all cores, or if the clocks are being throttled to consume lower power.

Thank you for your timely answer and appreciate your patient explanation.

For the choice of TCP and UDP.This part was previously decided by the team.And I only need to manage the receiving and subsequent processing work according to this predetermined UDP transmission. Therefore, we are trying to solve the problem with UDP transmission first.If we can not find a suitable solution, we will consider other transmission methods.

The hardware IRQ and software IRQ about the CPU kernel scheduling that you explained to me taught me why only three cores work efficiently. Thank you for broadening my knowledge.However, I hope to make full use of the performance of Xavier. The kernel scheduling by me may make the problem more complicated, but it is also a nice kind of trying idea. Thank you for your sharing.

For your questions about UDP packets, as expected, Xavier will only receive data from another board through the network port.Therefore are all sequential packets. And we hope to be able to keep up with the speed of the sending, which could be reduced to 32,000 packets per second after consultation with another board developer by reducing the size of each data and increasing the amount of transmitted data. During the test, Xavier is working with the nvpmode of MAXN, power mode 0, which is suppose to be the max performance mode.And only the python editor (pycharm) was opened.And only the main program is running (that is, the three processes, one process receives packets, one process unpacks and passes the data to the GPU for calculation, and the last process drives the control for other devices). The UDP length of each packet is 1,106 bytes, and the MTU is the default 1,500 bytes.Is there too much data to be handled?Because I’ve previously ignored counting the total amount of data to process, this data amount is about 32000×1106×8 bits per second, about 270 Mb/s.

The question of “sequential” is not about whether packets are being sent sequentially. Networking itself can cause them to arrive out of order. On a LAN this is less likely to occur. The more hops in the route or the more congested a route is (which implies any other traffic, not just yours), the more likely something will arrive out of order. This is one reason people might sometimes pick TCP.

However, it leads to the question: Can you tolerate lost packets? What happens if two packets are indeed out of order?

What size is the data? Is it always the same size? Is the data “small” relative to the packet size? Would you be better off accumulating a larger packet, or would you prefer a lot of smaller packets (with perhaps lower latency)? What kind of tolerance is there to latency? The MTU question still applies regardless of using UDP or TCP. It would be nice to know more about the data and consequences of imperfect networking. Even knowing you are on a LAN that is dedicated to this one task helps (there is limited bandwidth, but average bandwidth is not the only question).

Keep in mind that there is network overhead. I’m assuming IPv4. UDP has an overhead of something like 8 bytes (4 fields of 2 bytes). Let’s say you double the size of data for one packet…your overhead is cut in half. Let’s say you send two packets when one would do…then the packets need to be reassembled as well, which takes CPU power.

Normally, if data is ready such that the data plus header exactly equals the packet size (including header overhead), then the packet sends immediately. If the data is too large, then the data is fragmented, and then sent in multiple packets. If the data is too small, then the packet might wait for more data before sending in the hopes of including that data in a full packet; if a timer is reached before the packet is full, then the packet is sent anyway after that timer delay. I think the max MTU of the Jetson hardware/software is 9000 bytes (or close to that, including overhead of headers; both sides of the connection, and intervening route hops, may also limit this).

Overall I think the “average” bandwidth, if you only receive that data, is within limits. But is it all actually being consumed out of a small buffer before it is overrun?

Incidentally, if you run ifconfig after some networking has run, take a close look at the “RX” (receive) side statistics. Do you see any dropped or overruns? If not, then you are receiving (also check the sending end’s TX for the same). Incidentally, all of that traffic is handled by CPU0; DMA only helps for part of this. If CPU0 is loaded from other processes, then the dropped data can go up.

Alternate: If ifconfig is deprecated, then you can run “ip -s -a addr”.

Every packet will generate a hardware IRQ. When the CPU cannot respond in time (the network is not waiting for the Jetson when using TCP) due to being busy with other IRQs, then you have “IRQ starvation”. I don’t know if IRQ rate is saturated or not, but in theory, the networking alone is generating 32000 IRQs each second, plus the eMMC, plus any external storage, plus other internal devices, so on. When you are stuck on the first core (unlike a desktop PC, I am assuming there is no wiring for a hardware IRQ to other cores), then IRQ starvation starts to matter, especially with UDP since the protocol will gladly drop the data if the CPU core is not ready. DMA may mitigate this, but it won’t eliminate the issue since the header and protocol triggers an IRQ to service the data even if the transfer of non-overhead to some buffer is via DMA (this would certainly decrease the time the hardware IRQ is held by that packet, but you would still need a hardware IRQ).

I can’t say much on the GPU, but do realize this is using shared memory with the CPU (this is not a discrete GPU with its own memory). Maybe @dusty_nv or one of the other people who program CUDA could answer what the most efficient way is to share data with the GPU. It could be that the GPU is not consuming the data fast enough. There might be merit in copying a lot of network data to a buffer, and only then sending it to the GPU (don’t know). Perhaps the GPU itself requires CPU0 to get that data.

What is it that made the decision to send 1106 bytes at a time? Sending enough to fill a 9000 MTU (keep in mind some is overhead) might reduce the load. It depends on what latency you can live with, but a dropped packet is perhaps more of an issue than a millisecond of latency. If the data is being generated at a high enough rate, then perhaps filling an MTU 9000 buffer has almost no penalty while greatly reducing the hardware IRQ rate on CPU0.

Your other programs could be told to live on another CPU core, e.g., CPU1 or CPU2. This is the topic of assigning cgroups and then setting affinity:

You’d want to move processes which work with the data, but which are not tied to a hardware IRQ, to a core other than CPU0. You don’t necessarily want everything on separate cores since it depends on cache (which can hit or miss) for performance and forcing a cache miss can cost performance.


There is also a possibility of increasing priority of a critical process. You have to be very careful about overdoing this since you can easily break the system. Let’s say you have a program which receives the data and puts it in a buffer. The default priority is a “nice level” or “niceness” of 0. Being “nicer”, or having a higher nice level, means lowering the priority of that process. What you want is to “not” be nice and to hog the CPU. That means a negative niceness. The highest priority is a nice level of -20. You would destroy the system with that.

If you have a process on CPU0 which is reniced to -1, then other processes which are critical, e.g., the disk drive/eMMC driver, would hardly notice. You wouldn’t break storage. However, your other programs, which have a nice of 0, would not step on the process that fills the buffer (at least not as often). There are a lot of processes running that you don’t even see, and you’d have a slight advantage there.

Now let’s say you also have a program running on the GPU which pulls the data from that buffer. You might want to increase that program’s priority to a nice level of “-1”, and then bump the buffer accumulation program to a niceness of “-2” so it receives data before the GPU tries to process it. Now both programs have a slight advantage over other “average” (non-privileged) programs. You can’t really benefit from going far down that path because by the time you get to -5 you are going to have some issues with the operating system. Also, if you have a higher priority already over the non-critical processes, then you’ve already done what you can.

Note that a nice level in the negatives requires root authority. See “man nice” (on a PC, the Jetson won’t have the man pages) and “man renice”. One can start a program with a different nice level, or bump it while it runs. In fact you might want to let it run, observe performance, and then see what happens when you renice the running PID to -1 or -2.

Incidentally, if the process is on a different core than CPU0, then you have more freedom to renice. However, if you go too far negative (higher priority), then you probably want to make sure that process has affinity for some specific core and never migrates.

Sorry for not expressing clearly about the packet transmission problem. It is expected to connect to another board, like a sampling board, through the network cable directly. There is a very small fixed interval delay between each packet and no intermediate node. So the UDP package should be in constant order if there is no some sudden disturbance .So the hope is to receive without losing packets.And of course , the length and the type of data are the same each time.

What I want to do with DMA is to receive packets directly to a buffer from the port. Until the buffer has received 3200 packets (1 / 10 of 32000),it would remind the CPU to operate the data through a software IRQ and any other thing like this. This is equivalent to reducing the process of integrating data packets to reduce the CPU burden. Whether such operations can be implemented directly through a socket or anything else since the DMA programming seems like a little bit complicated? Because the network port also has a cache area when receiving data, can we directly work on this cache area to achieve the functions similar to that described above, even if the cache is not on the GPU? Putting aside the expectation to store the data to GPU, which could be handled by programming with a little processing delay in other process, we now mainly hope to achieve to receive the massive packets without loss. Since the other board was developed by another colleague and is not used by Xavier, I will consult him about the MTU setting first. See if we can reduce the number of data packets some more.But maybe it is difficult to change the MTU for them, because of the way they design the protocol.

About the CPU cores scheduling, we will also try to bind the socket to multiple cores to receive, and then restore the order when unpacking.

Really appreciate your prompt response and detailed explanation.If you have any idea about how other developers managed the receiving work of a video or image which maybe has a larger workload, please let me know.This may give us inspiration on other solutions.Thanks again for your hard work.

DMA is normally part of the network driver. Maybe someone from NVIDIA (@WayneWWW or @KevinFFF) could comment on whether the ethernet driver uses DMA by default. If not, what might be needed to use DMA from ethernet UDP to user space. I don’t know enough about the ethernet driver. Socket level programming does not have a DMA mechanism; you’d have to be working on the driver itself.

How much control do you have over the sending device? Is it possible to tell the sending device to send jumbo frames? I ask because the performance you are looking for is not always about the data. As you mention, DMA can result in the data copy to not require high CPU usage (the ethernet driver may already do this). Consider though that even if each packet received is low CPU requirement that this still requires an IRQ. The rate of IRQs also matters, it isn’t just about how long the CPU is held for a transfer. If you have any control over how the sender network is tuned, then you could gain significant performance by using jumbo packets (this would delay maybe 5 packets and then send the accumulated data as a single packet for a 9000 byte frame; the application inside the device would not know, nor care, about the accumulation prior to packet send).

Assuming you have a program which does nothing more than accumulate 3200 packets, and publish this via some mechanism in user space, I’m thinking that this may be the most you can do. If this were C/C++ I could tell you how to profile this and find out where time is being spent. Python no doubt has profiling mechanisms, but I couldn’t tell you what they are. Incidentally, the fact that you are accumulating 3200 packets prior to using them says your network would not even blink at accumulating 5 frames and sending in one packet (the time is insignificant and you are not releasing until the end receiver sees 3200 packets anyway). Imagine if the receiving end had to interrupt CPU0 only 1/5 as often.

I think if I were to write this I’d use C or C++, create a ring buffer with enough space for maybe a full second, and mark beginning and end address for 3200 packets, and then publish the data in that range with shared memory (read-only). On the other hand, you have not mentioned if the GPU is going to need direct access to this, and CUDA or similar would change how things should be organized since it has its own shared memory mechanism. You might also mention if this data gets copied directly to a GPU. I have no advice for doing this in Python.

You won’t succeed at binding the actual network driver to cores other than CPU0. Perhaps also for what the GPU does. For anything else in user space you can put that on its own dedicated core (the network driver works on CPU0, but sharing accumulated data with user space implies that after this your program won’t need to be on CPU0; CUDA and GPU usage could drastically change things).