Why are ksoftirqd and kworker threads resource-hungry ?

I’m developing a H.264 stream server.

I have some issues about MMAPI encoder, and an issue about RTP packet sender.
Using the System Profiler, I saw some resource eaters.

Running a traffic generator thread instead of H.264 encoder and RTP packet sender thread,
the kernel threads “ksoftirqd/" and "kworker/” ocupied CPU time.

I think that the purpose of these process is saving CPU time, but they have a bad influence.
Why are they running ?

My sendto() is stopping in “ready” status.

2017-02-28_194017_Screenshot_Softirq_kworker.png
2017-02-28_193611_Screenshot_RtpSender.png
2017-02-28_193610_Screenshot_RtpSender.png

Here is a URL with interesting replies on this topic:
http://askubuntu.com/questions/7858/why-is-ksoftirqd-0-process-using-all-of-my-cpu

Do note that software IRQs can run on any core, but hardware IRQs (servicing hardware devices wired to the CPU) can only run on core 0 (not true for a desktop system). Typically an interrupt loads and runs a driver…in the case of hardware IRQ there may be a desire to do things only possible on core 0 (such as I/O with wires…the wires do not physically touch the other cores), but there may also be data processing after the hardware-dependency which could run on any core…a good hardware driver will do the minimum work possible where only CPU0 can do the work, and then send the rest of the work off to another core via a software IRQ (this means the next hardware IRQ can start sooner). When one driver starts starving another due to hogging whatever core is available it is “irq starvation”…the interrupts are arriving faster than drivers can service them. You may see ksoftirq in relation to the software half of a hardware driver splitting off…the hardware half would always be on CPU0.

Optimizations could revolve around the topic of figuring out if you have a lot of overhead from a lot of IRQs which don’t have much work to do, and instead arranging for fewer IRQs with more to do…or the inverse, perhaps the driver has too much to do and won’t let go of CPU0, and you might better off running more interrupts and doing less work (a case of a bit more overhead in order to have a responsive system).

Dear linuxdev,

I appreciate your information.

I have already researched the purpose of ksoftirqd/kworker.
I think that my application software does not need too many hardware interruption.

Hardware access:

  • Ethernet Send: less than 10,000 [/sec] --- Call sendto() of socket
  • Ethernet Receive: less than 100 [/sec] --- Call recv() of socket
  • Video Encoder: none --- Using dummy data generation
  • Video Decoder: none
  • Clock / Timer: many --- Call usleep(), gettimeofday(), etc.

I wonder and don’t understand that ksoftirqd and kworker use much CPU resource.

In your reference URL, the questioner said:
I know it’s askubuntu.com, but on Raspberry Pi is quite the opposite, ksoftirqd is eating all of the CPU on intensive IRQ load.

Might any issues about ksoftirqd / kworker be remaining in Ubuntu for ARM system ?

I think to know what goes on you’d have to do some profiling directly inside the kernel (and I’m not sure how you would go about that) to see which IRQs are bottlenecking the others. Otherwise it is just speculation since it could be a single software driver misbehaving, or it could be a flood of IRQs from many legitimate interrupts on a busy system. The whole ksoftirq mechanism has its own scheduler, which I think is a good thing…it also means you have a chance to increase priority (renice) some threads if they persist and get better performance for whichever driver you are most worried about. I can’t really see your profiling, but does what you see give you information on specific drivers which you can identify?

Hi mynaemi, are you able to do a quick try if doing h264 encoding via gstreamer also observes this?

Hi DaneLLL,

I can execute the “gst-launch”, but cannot write any code using GStreamer API easily.
I want to change the bit rate, the resolution, and the frame rate (optional),
and want to be controled by client equipment (e.g. RSTP),
then I selected the OpenMAX IL API and nVIDIA Multimedia API.

Hi mynaemi,
So do you also observe high CPU usage via tegrastats? Please share the tegrastats of running your application for reference.

Hi mynaemi,
In running ~/tegra_multimedia_api/samples/10_camera_recording(manually set to 1920x1080 and skip file writing) on r24.2.1, we don’t observe the high CPU usage. It does not look like the H264 encoder take the CPU usage. Are you able to break down your app to get more clues?

Hi mynaemi,

Have you made good progress on this issue?
Any experiment result can be shared to move this issue forward?

Thanks

Hi DaneLLL and kayccc,

I appreciate your support very much.

Sorry, I’m hampered by pressure of writing year-end-report.

The difference between DaneLLL’s experiment and my application:
(1) Network access — many sendto()
(2) Timer access — many usleep() and gettimeofday(), etc.
But I think that these are not too much.

I wonder that these hardware accesses affect this phenomenon.

All hardware IRQ access starts in CPU0 core. The network card no doubt uses time on that core. It is possible that flooding CPU0 with network interrupts could change things for other drivers, but not necessarily (it’s plausible but you’d have to have evidence that the IRQs are overwhelming CPU0).

usleep() would likely improve things as it gives the ksoftirq scheduler a chance to smooth out its load and context switch to handle the higher priority soft IRQs (I’m assuming the usleep() is not within a hardware driver which is locking CPU0…a locked CPU0 core sleeping would be extraordinarily bad) and context switch among the non-CPU0 cores. This would give even a heavily loaded system the feeling of running smoothly.

Other calls like gettimeofday() would depend upon how the drivers are structured; perhaps gettimeofday() has some of its work offloaded to ksoftirq, which would be good, but if gettimeofday() were indirectly related to NTP and locked CPU0 while waiting for a remote answer this would be bad (I do not believe this is an actual issue, this is just for illustration).

Please help us to reproduce the issue with MMAPI samples(with patches) so that we can check further. The issue is not observed by running 10_camera_recording.

as said before: ksoftirqd does on Linux what the DPC mechanism does on Windows. It’s a necessary part of writing drivers that work well in a heterogenous system.
It seems likely to me that the high IRQ usage comes from the network access.
If you write a program that does 10,000 sendto() calls on some socket in a second, and that’s all it does, and run that, do you see the same IRQ level?
If so, then it’s pretty clear that the culprit is some part of the network stack.
Also, if you test on wireless versus ethernet, is there a difference?