Constant load average 1.0 caused by [nvgpu_channel_p]

miks1aeb0,

Sorry that I just saw this thread and didn’t go through every comment. Could you describe what is the “load” you are talking here?

It’s constant load average 1.0 and [nvgpu_channel_p] thread with D state in process list.
By trial and error I find out that changing /sys/devices/17000000.gp10b/railgate_enable to 1 trigger this behaviour.
Happens with latest Tegra_Linux_Sample-Root-Filesystem_R28.2.0_aarch64.tbz2 and Tegra186_Linux_R28.2.0_aarch64.tbz2 on TX 2 dev board.

miks1aeb0,

I remember there was some issue if we enable railgate. As a result, we disable it by default. Normal case should not use it. Do you enable it in the beginning?

May I ask why do you need a 0 loading here?

No, I’m not enabling it. Seems that it’s enabled by default. When disabling is triggered by nv.sh problems with [nvgpu_channel_p] starts.

miks1aeb0,

Your statement is not clear after comparing #24 and #22. What is the exact case that would trigger the loading? enabling railgate or disabling? The init script should disable the railgate.

Load is triggered with default installation from /etc/systemd/nv.sh.
When railgate “echo 0 > /sys/devices/17000000.gp10b/railgate_enable” is removed from nv.sh there is no [nvgpu_channel_p] thread at all and load average is as expected 0.0.

Running “echo 0 > /sys/devices/17000000.gp10b/railgate_enable” immediately makes [nvgpu_channel_p] to appear with D state and load average is increasing to 1.0.

miks1aeb0,

Thanks for clarification.

Thus,
railgate is enabled → no loading.
railgate is disable(default) → loading.

May I ask is this loading affect your usecase? (I am sorry in advance if you already mentioned it in previous comment)

“railgate is disable(default) → loading.”
In fact default state is “enabled” as it’s disabled with nv.sh script.
Interesting fact is that if I disable it and afterward enable, [nvgpu_channel_p] process stays.
It’s some kind of software bug as there is no such scenario with r27.1.

“May I ask is this loading affect your usecase?”
At this moment no, but I’m not comfortable to use TX2 with r28.2 on production environment with this kind of behaviour (disk read or maybe write).

Disk sleep state will not show up on user space tools so that explains why the individual process does not show. If you run this I see “1275” for PID on my system:

ps aux | egrep nvgpu_channel_p | egrep -v 'grep'

You can then go to “/proc/1275/” (adjust for your case) and “less status” to browse…you’ll see state is disk sleep. What I find odd is that it would hang around when railgate_enable is 0. Perhaps railgate was disabled at a bad moment when it was uninterruptible and the end of PID which would have occurred had it terminated instead of being disabled never happened. In this case it wouldn’t consume power or CPU cycles and contributing to load average would not really be valid…but if this were to result in holding on to some part of the disk without the ability to let go it might be an issue (it would be a limited resource leak).

Is there a way to know that the railgate disable in “nv.sh” did not occur in the middle of a system call and thus make the system call hang forever?

I think we misunderstand each other’s statement. The default status of railgate “from kernel” is enabled. However, that would have issue on tx2 module and thus we disable it in script. I called this as “default”.

I would help check with internal team to see if we can improve it in some way.

miks1aeb0,
I see uninterruptible sleep isn’t gone even with railgate_enable.
Any steps I missed?

root@tegra-ubuntu:~# cat /sys/devices/17000000.gp10b/railgate_enable
1
root@tegra-ubuntu:~# ps aux | egrep nvgpu_channel_p | egrep -v 'grep'
root       910  0.2  0.0      0     0 ?        D    06:01   0:00 [nvgpu_channel_p]

“Is there a way to know that the railgate disable in “nv.sh” did not occur in the middle of a system call and thus make the system call hang forever?”
It’s called on system init. I deleted it and called after a system has been running for some 2 minutes and behavior is same.

“I think we misunderstand each other’s statement. The default status of railgate “from kernel” is enabled. However, that would have issue on tx2 module and thus we disable it in script. I called this as “default”.”
What’s the issue when railgate is enabled on tx2?

“I would help check with internal team to see if we can improve it in some way.”
So, you will post results here, please?

You enabled it by hand or this is default state from kernel?

This thread may be one of the error you might hit once you change the setting in nv.sh.

I enabled it through commenting out the 0 > /sys/devices/17000000.gp10b/railgate_enable in nv.sh

It’s very strange then, as I’m getting no uninterruptible sleep when railgate is deleted from nv.sh (and system is restarted to boot without railgate setting changing).

@ WayneWWW so you just recommend to disable railgate with nv.sh and ignore that [nvgpu_channel_p] behaviour for now?
I certainly don’t want to get into issue you linked.

miks1aeb0,

I’ll check if any workaround.

And so do I.

-albertr

Just checked…
We have a worker in nvgpu that runs in the background and examines the channel state periodically (and kills channels if there is timeout) and we do have uninterruptible wait in gpu driver.

As I indicated in previous comment, our team would like to know if this causes problems to your usecase.