pmu_enable_hw: Falcon mem scrubbing timeout

marcus_c · January 22, 2018, 8:16am

Hi. I’m having an issue on the TX2 with both 28.1 and 28.2. After the system has been running for a while, rendering to the screen freezes and the following can be seen in dmesg:

[77893.667592] gk20a 17000000.gp10b: pmu_enable_hw: Falcon mem scrubbing timeout
[77893.667619] gk20a 17000000.gp10b: pmu_copy_to_dmem: copy failed. bytes written -19924, expected 44
[77893.667651] gk20a 17000000.gp10b: pmu_copy_to_dmem: copy failed. bytes written 4608, expected 76
[77903.669543] gk20a 17000000.gp10b: Timeout detected @ pmu_exec_gen_bl+0x198/0x760 
[77903.669556] gk20a 17000000.gp10b: pmu_wait_for_halt: ACR boot timed out
[77903.669725] gk20a 17000000.gp10b: gk20a_pm_finalize_poweron: failed to init gk20a pmu
[77904.185477] gk20a 17000000.gp10b: gk20a_submit_channel_gpfifo: failed to host gk20a to submit gpfifo, process X
[77904.200574] gk20a 17000000.gp10b: gk20a_submit_channel_gpfifo: failed to host gk20a to submit gpfifo, process X
[77904.200782] gk20a 17000000.gp10b: gk20a_submit_channel_gpfifo: failed to host gk20a to submit gpfifo, process X

More of the gpfifo errors follow.
If the screen was not blanked at the moment, I can see the mouse pointer still moving around, but no rendering takes place. If the screen was blanked it stays black.

If I try to restart X it just says

[ 81718.636] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
[ 81718.636] (EE) NVIDIA(0): Failing initialization of X screen 0

Only by rebooting am I able to get on-screen graphics back.

I’m confused by the reference to “boot” and “poweron”; as you can see from the timestamps the system had been up for over 21 hours already. There was no dmesg output for 7 hours prior to the Falcon mem one.

This never happened on the TX1 with 24.1.
I notice several mentions of “pmu” in the errors. Is there some power management setting I could tweak in the kernel config to make this go away?

WayneWWW · January 23, 2018, 2:57am

marcus_c,

This error seems rare. Since it is a pmu error, could you describe how long(21 hours?) and how stressful did your TX2 run?

Any fast way to reproduce this issue? Do you run gpu?

marcus_c · January 23, 2018, 9:10am

Hi.

I normally keep the TX2 turned on 24/7, but only subject it to light interactive loads (source code editing, web browsing, etc). I run Gentoo, so I will occasionally give it a lot of software to compile at once (with -j5), but this does not seem to increase the risk of lockup. Since Chromium fails to init direct rendering (for reasons I haven’t investigated yet) the only hardware rendering that happens is the 2D acceleration in X. I did experiment with running a Monero miner using CUDA for a few weeks, but that did not increase the amount of lockups either.

I usually discover the issue when I switch the KVM to the TX2 and try to unblank the screen. Only on one or two occasions has it happened while actively using the TX2. It can take days or even weeks before it happens, 21 hours is in the lower end of the spectrum. I have no idea how to make it happen faster I’m afraid.

marcus_c · January 23, 2018, 9:30am

I put my custom kernel config here (it is based on one from my old TX1) in case there is somethin visibly strange with it…

WayneWWW · January 23, 2018, 9:42am

marcus_c,

I suspect the Monero miner(though not high possibility) and KVM (you have installed KVM on TX2?). Other items seem not the of cause error.

Is issue occurred when using “pure” jetpack BSP?

marcus_c · January 23, 2018, 10:24am

I realize that KVM can mean two things here. :-)

I do have kernel virtualization (/dev/kvm). I’ve only used it for a few test runs (so far). It’s never been in use at a point when I’ve seen the problem. Also, the problem did exist even before I did the neccesary Device Tree changes to make KVM actually init.
I have the TX2 hooked up to a KVM switch (with a HDMI->DVI adapter). This one I use a lot, switching back and forth to the TX2 multiple times a day.

As for the monero miner, as I said I only ran it at one time for a few weeks. I have observed the issue both before and after that.

My firmwares and X11/GL drivers are from the 28.2 driver package. Everything else is compiled locally, including the kernel (source also from 28.2 package). Before I had the same setup but with 28.1 drivers/firmwares and kernel sources and the same issue. It was due to the fact that upgrading to 28.2 did not resolve the issue (even though some of the changes to video driver code looked like they might be relevant) that I finally bothered to report the problem. :-)

While I could boot up a “pure” environment from the internal MMC, I don’t really have a way of triggering the issue at will, so the only way to check would be to use the system normally for at least a week, which would be somewhat impractical since it has not been integrated in my work environment…

WayneWWW · January 24, 2018, 2:36am

marcus_c,

pmu error is a rare one. How much did you modify the kernel image? We should narrow down the cause, so I suggest to use original BSP/rootfs if possible.

marcus_c · January 24, 2018, 9:17am

The only changes to the kernel sources are some minor patches to fix compilation errors, and the DT fix (which the issue happens even without).
Otherwise it’s just the updated kernel config.

As I said, I could use the original kernel and rootfs to run some tests, but it’s not really a solution going forward. If I don’t do anything with the environment then it’s not going to do the stuff I need, so then it doesn’t matter if it locks up or not…

The present situation with the lockups is annoying, but having to throw away everything and start over from scratch would be much worse. I’m mostly looking for ideas how to get closer to a solution without destroying my actual work environment. Since it takes so long for the issue to manifest I need to be able to keep using the system normally during the time.

If I put KVM work on hold for the moment I might be able to use the stock kernel for a while, I’ll need to double check that. I could also try matching the xorg and mesa versions of the JetPak rootfs. Shouldn’t blindly change too many things at once though, as then we’d still not know which change was relevant…

If there are any kernel traces you want me to enable/add I’d be happy to do so.

WayneWWW · January 24, 2018, 9:25am

Please share the current dmesg to us as first step.

marcus_c · January 24, 2018, 9:39am

Certainly. Here’s my current /var/log/dmesg.

WayneWWW · January 25, 2018, 8:52am

I just checked.

Ram scrubbing is HW sequence and SW has no role to play there, except triggering it. That shows that voltage or clocks are not coming up properly.

WayneWWW · January 25, 2018, 9:12am

Do you have any fast way to reproduce issue? Falcon mem scrubbing timeout might happen due to a bad reset or wrong voltage.

marcus_c · January 25, 2018, 9:27am

So if the voltage or clocks “are not coming up properly”, then that means that they must have “gone down” first. Is there a way to prevent this? (I’m on wall power so I don’t particularly need super power saving modes…)

I’m pretty sure that the Monero miner put a lot of stress on the gpu, but as I said it did not seem to increase the occurance of the issue. How many levels do the voltages and clocks have that they need to transition between, and can they be monitored? It would be interresting to see if it is a specific state transition (which would then happen very rarely) that triggers the issue.

WayneWWW · January 25, 2018, 9:36am

During your 21 hours test, is the device just idle?
How frequent is this issue now?

marcus_c · January 25, 2018, 10:17am

It was like most other times, I was switching back and forth, sometimes browsing some web pages, sometimes editing a file or two, sometimes compiling something. As per usual, I was switching to the TX2 as it was idling in the background when I discovered that the issue had happened.

Right now the issue has not popped up yet after the 21h run, so we’re at 3 days and counting. This is normal; as I said it goes days or weeks between the instances, with the TX2 running 24/7 (but mostly idle). So I think it would be valuable if I could add some relevant statistic capture in preparation for the next time, if any ideas surface. It’s anyway never the case that I start some heavy computation and that triggers the issue.

WayneWWW · January 26, 2018, 4:51am

Hi marcus_c,

Please do following before the test

echo 0 > /sys/devices/gpu.0/railgate_enable
cd /sys/devices/gpu.0/devfreq/17000000.gp10b
echo max_freq > min_freq # fix gpu freq to maximum
cat cur_freq #should be the same as max_freq

Then, monitor below value before and after the error occurs

sudo -s

cat /sys/kernel/debug/bpmp/debug/clk/gpu/dvfs
cat /sys/kernel/debug/bpmp/debug/clk/clk_tree 
cat /sys/kernel/debug/bpmp/debug/regulator/vdd_gpu/voltage

You can dump the value to a file.

marcus_c · January 26, 2018, 3:18pm

Hi WayneWWW,

Thanks. I’ve started a script which dumps these to a log every minute. I’ll let you know the results after the next time the issue appears.

WayneWWW · January 31, 2018, 7:42am

Hi marcus_c,

Any update?

marcus_c · January 31, 2018, 5:02pm

Hi WayneWWW,

Nope, it’s still trucking along without lockups with an uptime of 9 days 8:38.

Should probably let it run for a few more weeks before declaring that the railgate/min_freq change fixed it completely though.

But in other news my work with KVM is coming along nicely, I now have a aarch64_be guest running with full hardware virtualization. :-)

WayneWWW · February 14, 2018, 2:47am

Any update for that error?

Topic		Replies	Views
Jetson TX2 NX crash with gpu lockups Jetson TX2 gpu	1	940	January 5, 2022
Constant load average 1.0 caused by [nvgpu_channel_p] Jetson TX2	59	4768	March 14, 2019
Drive PX2 shut down or freezes randomly. General	14	1445	April 29, 2019
L4T 28.1 kernel lockup/crash Jetson TX2	25	3983	September 20, 2017
Jetson TX2 Kernel crashed after running for a while Jetson TX2 kernel	65	3605	June 19, 2021
TK1 demo board hang Jetson TK1	33	3553	December 18, 2017
Is it possible to adjust GPU voltage on the TX2? Jetson TX2 hw	11	932	September 30, 2020
TX2 4GB Boot Problems: always stuck on NVIDIA boot logo (too many gk20a interrupts irq_74/-gk20a_st) Jetson TX2	12	1588	April 3, 2020
Jetson TX2NX fails to boot over time (Hard lockup on CPU) Jetson TX2 boot	2	147	January 3, 2025
HDMI and DisplayPort (Dual display mode) Bandwidth issue at Jetpack4.2(kernel 4.9.140 tegra) Jetson TX2	34	2155	July 1, 2019

pmu_enable_hw: Falcon mem scrubbing timeout

Related topics