Constant load average 1.0 caused by [nvgpu_channel_p]

WayneWWW · April 26, 2018, 7:46am

miks1aeb0,

Sorry that I just saw this thread and didn’t go through every comment. Could you describe what is the “load” you are talking here?

miks1aeb0 · April 26, 2018, 7:53am

It’s constant load average 1.0 and [nvgpu_channel_p] thread with D state in process list.
By trial and error I find out that changing /sys/devices/17000000.gp10b/railgate_enable to 1 trigger this behaviour.
Happens with latest Tegra_Linux_Sample-Root-Filesystem_R28.2.0_aarch64.tbz2 and Tegra186_Linux_R28.2.0_aarch64.tbz2 on TX 2 dev board.

WayneWWW · April 26, 2018, 8:11am

miks1aeb0,

I remember there was some issue if we enable railgate. As a result, we disable it by default. Normal case should not use it. Do you enable it in the beginning?

May I ask why do you need a 0 loading here?

miks1aeb0 · April 26, 2018, 8:14am

No, I’m not enabling it. Seems that it’s enabled by default. When disabling is triggered by nv.sh problems with [nvgpu_channel_p] starts.

WayneWWW · April 26, 2018, 8:38am

miks1aeb0,

Your statement is not clear after comparing #24 and #22. What is the exact case that would trigger the loading? enabling railgate or disabling? The init script should disable the railgate.

miks1aeb0 · April 26, 2018, 10:08am

Load is triggered with default installation from /etc/systemd/nv.sh.
When railgate “echo 0 > /sys/devices/17000000.gp10b/railgate_enable” is removed from nv.sh there is no [nvgpu_channel_p] thread at all and load average is as expected 0.0.

Running “echo 0 > /sys/devices/17000000.gp10b/railgate_enable” immediately makes [nvgpu_channel_p] to appear with D state and load average is increasing to 1.0.

WayneWWW · April 26, 2018, 10:17am

miks1aeb0,

Thanks for clarification.

Thus,
railgate is enabled → no loading.
railgate is disable(default) → loading.

May I ask is this loading affect your usecase? (I am sorry in advance if you already mentioned it in previous comment)

miks1aeb0 · April 26, 2018, 10:27am

“railgate is disable(default) → loading.”
In fact default state is “enabled” as it’s disabled with nv.sh script.
Interesting fact is that if I disable it and afterward enable, [nvgpu_channel_p] process stays.
It’s some kind of software bug as there is no such scenario with r27.1.

“May I ask is this loading affect your usecase?”
At this moment no, but I’m not comfortable to use TX2 with r28.2 on production environment with this kind of behaviour (disk read or maybe write).

linuxdev · April 26, 2018, 9:07pm

Disk sleep state will not show up on user space tools so that explains why the individual process does not show. If you run this I see “1275” for PID on my system:

ps aux | egrep nvgpu_channel_p | egrep -v 'grep'

You can then go to “/proc/1275/” (adjust for your case) and “less status” to browse…you’ll see state is disk sleep. What I find odd is that it would hang around when railgate_enable is 0. Perhaps railgate was disabled at a bad moment when it was uninterruptible and the end of PID which would have occurred had it terminated instead of being disabled never happened. In this case it wouldn’t consume power or CPU cycles and contributing to load average would not really be valid…but if this were to result in holding on to some part of the disk without the ability to let go it might be an issue (it would be a limited resource leak).

Is there a way to know that the railgate disable in “nv.sh” did not occur in the middle of a system call and thus make the system call hang forever?

WayneWWW · April 27, 2018, 9:20am

I think we misunderstand each other’s statement. The default status of railgate “from kernel” is enabled. However, that would have issue on tx2 module and thus we disable it in script. I called this as “default”.

I would help check with internal team to see if we can improve it in some way.

WayneWWW · April 30, 2018, 6:06am

miks1aeb0,
I see uninterruptible sleep isn’t gone even with railgate_enable.
Any steps I missed?

root@tegra-ubuntu:~# cat /sys/devices/17000000.gp10b/railgate_enable
1
root@tegra-ubuntu:~# ps aux | egrep nvgpu_channel_p | egrep -v 'grep'
root       910  0.2  0.0      0     0 ?        D    06:01   0:00 [nvgpu_channel_p]

miks1aeb0 · April 30, 2018, 6:07am

“Is there a way to know that the railgate disable in “nv.sh” did not occur in the middle of a system call and thus make the system call hang forever?”
It’s called on system init. I deleted it and called after a system has been running for some 2 minutes and behavior is same.

miks1aeb0 · April 30, 2018, 6:09am

“I think we misunderstand each other’s statement. The default status of railgate “from kernel” is enabled. However, that would have issue on tx2 module and thus we disable it in script. I called this as “default”.”
What’s the issue when railgate is enabled on tx2?

“I would help check with internal team to see if we can improve it in some way.”
So, you will post results here, please?

miks1aeb0 · April 30, 2018, 6:10am

WayneWWW:

miks1aeb0,
I see uninterruptible sleep isn’t gone even with railgate_enable.
Any steps I missed?

root@tegra-ubuntu:~# cat /sys/devices/17000000.gp10b/railgate_enable
1
root@tegra-ubuntu:~# ps aux | egrep nvgpu_channel_p | egrep -v 'grep'
root       910  0.2  0.0      0     0 ?        D    06:01   0:00 [nvgpu_channel_p]

You enabled it by hand or this is default state from kernel?

WayneWWW · April 30, 2018, 6:13am

This thread may be one of the error you might hit once you change the setting in nv.sh.

WayneWWW:
miks1aeb0,
I see uninterruptible sleep isn’t gone even with railgate_enable.
Any steps I missed?
root@tegra-ubuntu:~# cat /sys/devices/17000000.gp10b/railgate_enable
1
root@tegra-ubuntu:~# ps aux | egrep nvgpu_channel_p | egrep -v 'grep'
root       910  0.2  0.0      0     0 ?        D    06:01   0:00 [nvgpu_channel_p]
You enabled it by hand or this is default state from kernel?

I enabled it through commenting out the 0 > /sys/devices/17000000.gp10b/railgate_enable in nv.sh

miks1aeb0 · April 30, 2018, 7:57am

It’s very strange then, as I’m getting no uninterruptible sleep when railgate is deleted from nv.sh (and system is restarted to boot without railgate setting changing).

miks1aeb0 · April 30, 2018, 7:59am

@ WayneWWW so you just recommend to disable railgate with nv.sh and ignore that [nvgpu_channel_p] behaviour for now?
I certainly don’t want to get into issue you linked.

WayneWWW · April 30, 2018, 8:06am

miks1aeb0,

I’ll check if any workaround.

albertr · April 30, 2018, 10:40am

And so do I.

-albertr

WayneWWW · May 2, 2018, 2:06am

Just checked…
We have a worker in nvgpu that runs in the background and examines the channel state periodically (and kills channels if there is timeout) and we do have uninterruptible wait in gpu driver.

As I indicated in previous comment, our team would like to know if this causes problems to your usecase.

Topic		Replies	Views
pmu_enable_hw: Falcon mem scrubbing timeout Jetson TX2	61	3435	October 18, 2021
Jetson TX2 NX crash with gpu lockups Jetson TX2 gpu	2	920	January 19, 2022
TX2 4GB Boot Problems: always stuck on NVIDIA boot logo (too many gk20a interrupts irq_74/-gk20a_st) Jetson TX2	12	1564	April 3, 2020
JetPack 4.6.3 preempt-rt patkernel: reboot loop Jetson TX2 boot , preempt_rt	34	2498	March 27, 2024
Constant load average 1.0 caused by [nvgpu_channel_p] - part.2 Jetson TX2	2	439	June 15, 2018
Drive PX2 shut down or freezes randomly. General	15	1403	October 12, 2021
Performance Variance Between Jetsons Jetson TX2	54	2748	October 18, 2021
L4T 28.1 kernel lockup/crash Jetson TX2	25	3914	September 20, 2017
Xorg use high cpu loading after Ubuntu suspend/wake up many times Jetson Xavier NX nvbugs , performance	22	3785	October 19, 2022
Pascal Titan X's GPU's falling off the bus Linux	0	928	December 29, 2016

Constant load average 1.0 caused by [nvgpu_channel_p]

Related topics