Constant load average 1.0 caused by [nvgpu_channel_p]

For custom built rootfs with debootstrap (Ubuntu 16.04, xenial) and NVIDIA binaries (./apply_binaries.sh) applied there is constant load average without any active process running.
After looking for a while I inspected whether there is some IO happening and found out process [nvgpu_channel_p] who indeed has D state.

For clean Tegra_Linux_Sample-Root-Filesystem_R28.2.0_aarch64.tbz2 load average is as expected - 0.0.
Strange thing is, that if I applied NVIDIA binaries also to Tegra_Linux_Sample-Root-Filesystem_R28.2, same disk and CPU usage is happening by [nvgpu_channel_p] with load average around 1.0.

Is there some misconfiguration in NVIDIA binaries that’s doesn’t exist in clean Tegra_Linux_Sample-Root-Filesystem_R28.2.0?

Also, from my understanding Tegra_Linux_Sample-Root-Filesystem_R28.2.0 already has binaries applied, but from file tree diff I can see that there are multiple new files appearing after applying binaries to Tegra_Linux_Sample-Root-Filesystem_R28.2.0.
Why it’s that?

processes with D state filtered:

nvidia@tegra-ubuntu:~$ ps -aux | awk {'if ($8 ==  "D") print $0'}
root       988  0.2  0.0      0     0 ?        D    09:48   0:01 [nvgpu_channel_p]

iostat:

nvidia@tegra-ubuntu:~$ iostat
Linux 4.4.38-tegra (tegra-ubuntu)       04/22/2018      _aarch64_       (6 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.64    0.57    1.09    0.27    0.00   96.43

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
mmcblk0          23.21      1136.24       233.32     674574     138520
mmcblk0rpmb       0.01         0.03         0.00         20          0
mmcblk0boot1      0.06         0.22         0.00        132          0
mmcblk0boot0      0.05         0.22         0.00        128          0

Debugging further it looks like changing “/sys/devices/17000000.gp10b/railgate_enable” trigger this phantom disk read.
After boot “/sys/devices/17000000.gp10b/railgate_enable” has default value of “1”.
Setting it to “0” immediately triggers [nvgpu_channel_p] to start doing something with disk (read? but what?).
Even changing “/sys/devices/17000000.gp10b/railgate_enable” back to “1” does not make [nvgpu_channel_p] disk usage to go away.

Any ideas?
I’m using http://connecttech.com/product/orbitty-carrier-for-nvidia-jetson-tx2-tx1/ carrier board with BSP for “PWM Fan Support” and “USB Support” applied. Same happen on development board.

We have exactly the same issue, using Auvidea J120 carrier board with TX2 and L4T 28.1 kernel.
Let me know if you can figure it out.

And question for NVidia folks - what “nvgpu_channel_p” kernel thread does exactly?

-albertr

As I’m not using GUI these steps somehow make “nvgpu_channel_p” disk usage go away:
1)uninstalling lightdm with “sudo apt remove lightdm”
2) commenting out " echo 0 > /sys/devices/17000000.gp10b/railgate_enable" line in /etc/systemd/nv.sh
3) restart

Off course, I’m interest more in way how to fix root of the problem, not just symptoms.

Another 2 cents - for custom rootfs building “apply_binaries.sh” just isn’t right tools as there are so many unneeded files and services installed by it. One way is to manually edit “apply_binaries.sh” script, but maybe there are better ways?
Some kind of documentation or something like that?
“NVIDIA_Tegra_Linux_Documentation” is not helping much.

As I’m getting this also from flashing sample filesystem from JetPack GUI to TX2 dev board, I’m wondering why only two of us got this problem?
I have no problem with another dev board where stock r27.1 filesystem exists.
Maybe there are some missing pieces in r28.2 “apply_binaries.sh”?

The only thing I can think of is that apply_binaries.sh would need to run as root, but if you didn’t use root, then some files either would not be installed or would perhaps become inaccessible even if there (“sudo” itself is an example of a file which works only if root permissions were used in the first place to install it…in other cases you might end up with a previous version of the file instead of the apply_binaries.sh version). The R27.0.1 which the TX2 originally shipped with was problematic anyway…I think R27.1 was much better, but was still not entirely reliable. Once you get to R28.1 things are very stable. If you are running into issues related to R28.1 or R28.2 then typically it is part of the install procedure if odd issues occur right away after flash.

You might be interested in installing package “iotop” for monitoring resources (try “iotop -o”). On some other distributions I like “ftop”, but I haven’t found ftop for Ubuntu. There is also “atop”.

That works, thanks! As far as only two of us are seeing this issue, I think there’re not so many people around using non-DevKit boards, and I’m pretty sure we both did customize our filesystems, for instance I did disable bunch of nvidia supplied services and disable many nodes in device tree. Maybe some of these changes did trigger the issue. I also would like to know what it is exactly so I can fix it for good.

-albertr

Do either “iotop -o” or “htop” show something using CPU or other resources matching that load?

“The only thing I can think of is that apply_binaries.sh would need to run as root, but if you didn’t use root, then some files either would not be installed or would perhaps become inaccessible”
I’m using sudo all the way of building flash image. Problem happen even when flashing with JetPack GUI where you only need to click some “check” and “next” buttons without typing sudo or becoming root manually.

“I think there’re not so many people around using non-DevKit boards, and I’m pretty sure we both did customize our filesystems”
For me problem happen also with unmodified “Tegra_Linux_Sample-Root-Filesystem_R28.2.0_aarch64.tbz2” rootfs and DevKit board. I swapped DevKit TX2 installed on board with another non-flashed TX2 to test whether problem happen also on DevKit board. I assume that DevKit TX2 is identical to any other TX2, so only culprit seems to be the r28.2.

“Do either “iotop -o” or “htop” show something using CPU or other resources matching that load?”
I mentioned in first post that there is no CPU load at all.
iotop and htop shows nothing interesting.

I’m running L4T 28.1, so the issue seems to affect both 28.1 & 28.2.

-albertr

Are there any other support from NVIDIA is possible except this forum?

In case the issue could be related to specific revision of TX2 module - mine is attached.

-albertr

From my TX2’s first lines is unique, but on second line everything is same except I have B02 instead of B01.
I guess it’s kind of revisions or something like that.

I don’t see a problem on my R28.2…not sure which revision I have, it is from the very first batch ever shipped.

One thing I wonder about is if one of the load average tools might be including “idle” as part of the load (which would be a bug). Example:

avg-cpu:  %user   %nice %system %iowait  %steal   <b>%idle</b>
           1.64    0.57    1.09    0.27    0.00   <b>96.43</b>

“I don’t see a problem on my R28.2…not sure which revision I have, it is from the very first batch ever shipped.”
Are NVIDIA silently releasing updates to R28.2 without changing minot release number?
I noticed this as I got different size and md5 content hash for Tegra_Linux_Sample-Root-Filesystem_R28.2.0_aarch64 and Tegra186_Linux_R28.2.0_aarch64.tbz2 files downloaded yersterday and before one week.

Build date for mine problematic OS install looks really old…

root@tegra-ubuntu:/home/nvidia# cat /etc/nv_tegra_release
# R28 (release), REVISION: 2.0, GCID: 10136452, BOARD: t186ref, EABI: aarch64, DATE: Fri Dec  1 14:20:33 UTC 2017

“One thing I wonder about is if one of the load average tools might be including “idle” as part of the load (which would be a bug).”

top - 06:04:28 up 3 min,  2 users,  load average: 1.10, 0.66, 0.28
Tasks: 315 total,   1 running, 314 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.2 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8041464 total,  6659368 free,   706896 used,   675200 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7237768 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
2204 nvidia    20   0    9008   3316   2576 R   1.6  0.0   0:02.42 top
7 root      20   0       0      0      0 S   0.3  0.0   0:00.23 rcu_preempt
106 root      20   0       0      0      0 S   0.3  0.0   0:00.13 kworker/3:1
894 root      20   0       0      0      0 D   0.3  0.0   0:00.42 nvgpu_chan+

My actual hardware is from an early batch…the R28.2 should be the same for everyone, though I couldn’t swear to it. Do note that if a compressed package has a different compression level the md5sum will change (and any bz2 or tbz2 file qualifies for this)…it’s the packages inside that matter. For that case take a look at the drivers which are provided by NVIDIA instead of by Ubuntu…their sha1sums are in “/etc/nv_tegra_release”, which in turn is provided in the driver package “nv_tegra/nvidia_drivers.tbz2”.

I was looking at this and decided that actually measuring load average gives a short burst of load. Regardless of using htop, top, or uptime the load average is listed for three times: The last 1 min, the last 5 min, and the last 15 min. In your example:

load average: 1.10, 0.66, 0.28

…so over the last 15 minutes the load was on average 0.28. The load is going up as you approach the current time. My desktop PC is slightly lower but still is nearly 1 if looking at the last minute, and goes down over long term average. My desktop though does hardware I/O on all cores and doesn’t throttle the way a Jetson does so the PC does not go down over time as much as my Jetson does (plus I’m doing things like ssh and firefox on the PC).

“…so over the last 15 minutes the load was on average 0.28”
It was first two minutes after restart. In period longer than 15min all 3 stats has value around 1.0.

I verified that the 5 and 15 min loads gradually approach 1 while the instantaneous load is 1 (I recorded load for about half an hour and watched it).

Pseudo-related: I find the program “ttyload” very useful for this. Also “glances”. Actual data used by these programs all refer to “/proc/loadavg”. Using different programs with lower overhead was just to prove to myself it wasn’t the tool itself causing the load. “xosview” is nice for a graphical view and has a kind of moving average graph (you have to enlarge the window to appreciate the graph). “nmon” allows picking and choosing resource monitoring.

I have not found any process which would account for this. It is possible that the loadavg file is mistakenly including the “idle” process (when nothing is using the system a process still has to run…perhaps loadavg forgot to rule out this process which approaches 100% use as all other processes go away).

I’m not sure which code in the kernel determines the content of “/proc/loadavg”, but it seems this is probably a bug…either a bug in a process not being visible, or a bug in loadavg taking into account idle as if it were a real process participating in load.

EDIT: I did see something which seems to be a possible weak point in load average. See this…perhaps there are uninterruptible tasks which are blocked but being counted as a load:
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

I don’t think it’s a load average calculating bug as it’s clearly visible that nvgpu_channel_p thread is doing something with disk (no ideas what).