The system becomes very slow, while running netwoking tasks on jetson TX2

When I perform a git push on a jetson TX2, or, from my other machine RSYNC user@tx2:somefile locally,
The jetson linux system becomes very slow. And many retries will be lost for this reason, resulting in transmission interruption failure.
It was a public network environment, the network speed was about 3-5M/s at that time, and I checked the system resources such as CPU or memory,also iostat shows, all of them occupied very low. How can I confirm what the problem is.

I won’t be able to give you a good answer, but this is a starting point…

Monitor “dmesg --follow” while doing this. See if error messages show up.

Is this Wi-Fi, or is it wired? You’ll find ifconfig output is good for both, and you should get a copy of that prior to starting your download/activity. Example with a log:
ifconfig 2>&1 | tee log_ifconfig_start.txt
(you can append " 2>&1 | tee log_name.txt" to any command line to log it; or you could do this from a serial console while logging)

For Wi-Fi, these:
rfkill 2>&1 | tee rfkill_start.txt
iwconfig 2>&1 | tee iwconfig_start.txt

Note that if you name a wireless interface, then you can use iwlist to see details about a particular interface. Check “iwlist --help”. Then, as an example:
iwlist wlan0 channel 2>&1 | tee wlan0_channel.txt
(useful if you want to know about some particular detail)

If you have a description of these items before you start, and then again after a problem, this is useful. Notice that some of the commands, especially ifconfig, lists a number of statistics (or on systems which are newer, “ip -s addr”…the “-s” is for statistics). Those statistics regarding errors, packet counts, dropped, overruns, framing errors, collisions, so on, are very useful in figuring out configuration errors (versus actual signal quality issues).

Thanks a lot for your guidance, and sorry for not being able to respond in time, I had to divert my attention for two days because of other higher priorities.
I said ‘The jetson linux system becomes very slow’ before, means that CPU scheduling is very slow, shell terminals and many processes in the background cannot scheduled in time. The network transfer rate appears to be normal, but it often terminates abnormally because it is not scheduled.

nvpmodel is 0, MAXN

Scheduling is triggered whenever there is either a software IRQ or a hardware IRQ. Software IRQs can be distributed to any CPU core, but hardware IRQs can run only on cores with actual wiring going to them. CPU0 is the only core capable of most hardware IRQ servicing. You can see which hardware IRQ runs on which core at any given instant via “cat /proc/interrupts”.

One can use CPU core affinity to migrate an IRQ (which is essentially scheduling a driver) to any core. However, if you try to migrate a hardware IRQ which is not compatible with a core, then it will migrate back to CPU0 for a Jetson. AMD Intel CPUs on an amd64 desktop PC have what is called an I/O APIC, which is an Asynchronous Programmable Interrupt Controller, and it is via this that hardware IRQs can be routed to any core. Jetsons do not have this. Intel AMD has its own methods, although I don’t think it is called an I/O APIC.

This means that for a hardware driver to schedule CPU0 cannot be servicing an atomic section of another driver. If not in an atomic section, then one driver can have its state stored, and then the context of the other driver can be installed by the scheduler. There are a lot of atomic sections in drivers though, and sometimes drivers are badly written. All one needs is too many atomic sections, or other drivers with higher priority before some hardware driver is denied access.

A good driver will have the minimal time spent in atomic sections, and then trigger a software IRQ for those parts which do not need direct hardware access. For example, it is possible that some software might run on an ethernet card and it must be atomic, but then a checksum is computed; perhaps the checksum could be a software driver instead of a hardware driver, and the ethernet driver could shorten the time on CPU0. Someone designing the driver with the checksum also in the hardware IRQ servicing would be locking up CPU0 longer than needed. When there are too many IRQ requests, and hardware starts suffering, it tends to be called “IRQ starvation”.

It only takes one bad driver to slow down all of hardware. I don’t really have a way to tell you which driver is doing this. You could watch something like htop and see if some user space program is showing high load, and if that program is associated with a hardware resource (such as disk access), then “perhaps” this indirectly indicates which driver is involved. Unfortunately, “/proc/interrupts” does not contain performance counters which would tell you the amount of time a given hardware IRQ is taking.

If you have a suspect program, and if you can compile that program with profiling enabled, then you could profile that one program and find out if the calls which are taking most time use hardware access, e.g., if 99% of the time is spent waiting on disk I/O, then you could consider it likely this is the hardware IRQ which is starving the others. But that’s still guessing. Even if you did know the particular driver causing the problem, it might be that the driver is only the indicator, and that whatever is calling it is the real problem. I don’t know of a good solution other than temporarily eliminating some programs that are suspect and seeing if response improves.

I will say though that if too much RAM is used, and if swapping is increasing, then this is often the problem without it having anything to do with drivers. You could for example monitor htop and see if there is a lot of swap going on (to a disk; compressed ramdisks tend to not be nearly as slow as an actual disk), or if all of the RAM is used up.

1 Like

Thank you very much for your directional guidance, next I think I can try to track the exception from the network card driver layer. I can’t connect to jetsonTX2 for this week, and then if I confirm what caused this problem or need further help, I will reply to you. Thanks again.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.