pmu_enable_hw: Falcon mem scrubbing timeout

Hi. I’m having an issue on the TX2 with both 28.1 and 28.2. After the system has been running for a while, rendering to the screen freezes and the following can be seen in dmesg:

[77893.667592] gk20a 17000000.gp10b: pmu_enable_hw: Falcon mem scrubbing timeout
[77893.667619] gk20a 17000000.gp10b: pmu_copy_to_dmem: copy failed. bytes written -19924, expected 44
[77893.667651] gk20a 17000000.gp10b: pmu_copy_to_dmem: copy failed. bytes written 4608, expected 76
[77903.669543] gk20a 17000000.gp10b: Timeout detected @ pmu_exec_gen_bl+0x198/0x760 
[77903.669556] gk20a 17000000.gp10b: pmu_wait_for_halt: ACR boot timed out
[77903.669725] gk20a 17000000.gp10b: gk20a_pm_finalize_poweron: failed to init gk20a pmu
[77904.185477] gk20a 17000000.gp10b: gk20a_submit_channel_gpfifo: failed to host gk20a to submit gpfifo, process X
[77904.200574] gk20a 17000000.gp10b: gk20a_submit_channel_gpfifo: failed to host gk20a to submit gpfifo, process X
[77904.200782] gk20a 17000000.gp10b: gk20a_submit_channel_gpfifo: failed to host gk20a to submit gpfifo, process X

More of the gpfifo errors follow.
If the screen was not blanked at the moment, I can see the mouse pointer still moving around, but no rendering takes place. If the screen was blanked it stays black.

If I try to restart X it just says

[ 81718.636] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
[ 81718.636] (EE) NVIDIA(0): Failing initialization of X screen 0

Only by rebooting am I able to get on-screen graphics back.

I’m confused by the reference to “boot” and “poweron”; as you can see from the timestamps the system had been up for over 21 hours already. There was no dmesg output for 7 hours prior to the Falcon mem one.

This never happened on the TX1 with 24.1.
I notice several mentions of “pmu” in the errors. Is there some power management setting I could tweak in the kernel config to make this go away?

marcus_c,

This error seems rare. Since it is a pmu error, could you describe how long(21 hours?) and how stressful did your TX2 run?

Any fast way to reproduce this issue? Do you run gpu?

Hi.

I normally keep the TX2 turned on 24/7, but only subject it to light interactive loads (source code editing, web browsing, etc). I run Gentoo, so I will occasionally give it a lot of software to compile at once (with -j5), but this does not seem to increase the risk of lockup. Since Chromium fails to init direct rendering (for reasons I haven’t investigated yet) the only hardware rendering that happens is the 2D acceleration in X. I did experiment with running a Monero miner using CUDA for a few weeks, but that did not increase the amount of lockups either.

I usually discover the issue when I switch the KVM to the TX2 and try to unblank the screen. Only on one or two occasions has it happened while actively using the TX2. It can take days or even weeks before it happens, 21 hours is in the lower end of the spectrum. I have no idea how to make it happen faster I’m afraid.

I put my custom kernel config here (it is based on one from my old TX1) in case there is somethin visibly strange with it…

marcus_c,

I suspect the Monero miner(though not high possibility) and KVM (you have installed KVM on TX2?). Other items seem not the of cause error.

Is issue occurred when using “pure” jetpack BSP?

I realize that KVM can mean two things here. :-)

  • I do have kernel virtualization (/dev/kvm). I’ve only used it for a few test runs (so far). It’s never been in use at a point when I’ve seen the problem. Also, the problem did exist even before I did the neccesary Device Tree changes to make KVM actually init.

  • I have the TX2 hooked up to a KVM switch (with a HDMI->DVI adapter). This one I use a lot, switching back and forth to the TX2 multiple times a day.

As for the monero miner, as I said I only ran it at one time for a few weeks. I have observed the issue both before and after that.

My firmwares and X11/GL drivers are from the 28.2 driver package. Everything else is compiled locally, including the kernel (source also from 28.2 package). Before I had the same setup but with 28.1 drivers/firmwares and kernel sources and the same issue. It was due to the fact that upgrading to 28.2 did not resolve the issue (even though some of the changes to video driver code looked like they might be relevant) that I finally bothered to report the problem. :-)

While I could boot up a “pure” environment from the internal MMC, I don’t really have a way of triggering the issue at will, so the only way to check would be to use the system normally for at least a week, which would be somewhat impractical since it has not been integrated in my work environment…

marcus_c,

pmu error is a rare one. How much did you modify the kernel image? We should narrow down the cause, so I suggest to use original BSP/rootfs if possible.

The only changes to the kernel sources are some minor patches to fix compilation errors, and the DT fix (which the issue happens even without).
Otherwise it’s just the updated kernel config.

As I said, I could use the original kernel and rootfs to run some tests, but it’s not really a solution going forward. If I don’t do anything with the environment then it’s not going to do the stuff I need, so then it doesn’t matter if it locks up or not…

The present situation with the lockups is annoying, but having to throw away everything and start over from scratch would be much worse. I’m mostly looking for ideas how to get closer to a solution without destroying my actual work environment. Since it takes so long for the issue to manifest I need to be able to keep using the system normally during the time.

If I put KVM work on hold for the moment I might be able to use the stock kernel for a while, I’ll need to double check that. I could also try matching the xorg and mesa versions of the JetPak rootfs. Shouldn’t blindly change too many things at once though, as then we’d still not know which change was relevant…

If there are any kernel traces you want me to enable/add I’d be happy to do so.

Please share the current dmesg to us as first step.

Certainly. Here’s my current /var/log/dmesg.

I just checked.

Ram scrubbing is HW sequence and SW has no role to play there, except triggering it. That shows that voltage or clocks are not coming up properly.

Do you have any fast way to reproduce issue? Falcon mem scrubbing timeout might happen due to a bad reset or wrong voltage.

So if the voltage or clocks “are not coming up properly”, then that means that they must have “gone down” first. Is there a way to prevent this? (I’m on wall power so I don’t particularly need super power saving modes…)

I’m pretty sure that the Monero miner put a lot of stress on the gpu, but as I said it did not seem to increase the occurance of the issue. How many levels do the voltages and clocks have that they need to transition between, and can they be monitored? It would be interresting to see if it is a specific state transition (which would then happen very rarely) that triggers the issue.

During your 21 hours test, is the device just idle?
How frequent is this issue now?

It was like most other times, I was switching back and forth, sometimes browsing some web pages, sometimes editing a file or two, sometimes compiling something. As per usual, I was switching to the TX2 as it was idling in the background when I discovered that the issue had happened.

Right now the issue has not popped up yet after the 21h run, so we’re at 3 days and counting. This is normal; as I said it goes days or weeks between the instances, with the TX2 running 24/7 (but mostly idle). So I think it would be valuable if I could add some relevant statistic capture in preparation for the next time, if any ideas surface. It’s anyway never the case that I start some heavy computation and that triggers the issue.

Hi marcus_c,

Please do following before the test

echo 0 > /sys/devices/gpu.0/railgate_enable
cd /sys/devices/gpu.0/devfreq/17000000.gp10b
echo max_freq > min_freq # fix gpu freq to maximum
cat cur_freq #should be the same as max_freq

Then, monitor below value before and after the error occurs

sudo -s

cat /sys/kernel/debug/bpmp/debug/clk/gpu/dvfs
cat /sys/kernel/debug/bpmp/debug/clk/clk_tree 
cat /sys/kernel/debug/bpmp/debug/regulator/vdd_gpu/voltage

You can dump the value to a file.

Hi WayneWWW,

Thanks. I’ve started a script which dumps these to a log every minute. I’ll let you know the results after the next time the issue appears.

Hi marcus_c,

Any update?

Hi WayneWWW,

Nope, it’s still trucking along without lockups with an uptime of 9 days 8:38.

Should probably let it run for a few more weeks before declaring that the railgate/min_freq change fixed it completely though.

But in other news my work with KVM is coming along nicely, I now have a aarch64_be guest running with full hardware virtualization. :-)

Any update for that error?