Xavier NX becomes very slow after crashing at pmu_pg

I am seeing this crash report in dmesg:

[ 3936.938245] nvgpu: 17000000.gv11b gk20a_gr_isr:5994 [ERR] pgraph intr: 0x00000001, chid: INVALID
[ 3936.946235] nvgpu: 17000000.gv11b nvgpu_pmu_enable_elpg:208 [WRN] nvgpu_pmu_enable_elpg(): possible elpg refcnt mismatch. elpg refcnt=2
[ 3936.959463] ------------[ cut here ]------------
[ 3936.964467] WARNING: CPU: 2 PID: 3118 at /nvidia_sdk/JetPack_4.4_Linux_JETSON_XAVIER_NX/Linux_for_Tegra/sources/kernel/nvgpu/drivers/gpu/nvgpu/common/pmu/pmu_pg.c:209 nvgpu_pmu_enable_elpg+0x22c/0x2d0 [nvgpu]
[ 3936.984348] Modules linked in: bnep fuse cdc_acm uvcvideo zram overlay ar1335(O) 88x2bu(O) cfg80211 spidev nvgpu bluedroid_pm ip_tables x_tables

[ 3936.984402] CPU: 2 PID: 3118 Comm: irq/466-gk20a_s Tainted: G W O 4.9.140+ #5
[ 3936.984406] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[ 3936.984411] task: ffffffc1e0b48e00 task.stack: ffffffc1d953c000
[ 3936.984823] PC is at nvgpu_pmu_enable_elpg+0x22c/0x2d0 [nvgpu]
[ 3936.985237] LR is at nvgpu_pmu_enable_elpg+0x22c/0x2d0 [nvgpu]
[ 3936.985242] pc : [] lr : [] pstate: 40c00045
[ 3936.985245] sp : ffffffc1d953fbe0
[ 3936.985248] x29: ffffffc1d953fbf0 x28: 0000000000000000
[ 3936.985257] x27: 0000000000000000 x26: 0000000000000000
[ 3936.985265] x25: ffffff8001062300 x24: ffffffc1d950b000
[ 3936.985272] x23: ffffffc1d9502d28 x22: ffffff8001096b90
[ 3936.985279] x21: ffffff80010699a8 x20: ffffff8001069a20
[ 3936.985286] x19: ffffffc1d9500000 x18: 0000000000000010
[ 3936.985294] x17: 0000007f784671d0 x16: 0000000000000000
[ 3936.985301] x15: ffffffffffffffff x14: 6374616d73696d20
[ 3936.985309] x13: 746e636665722067 x12: 706c6520656c6269
[ 3936.985316] x11: 73736f70203a2928 x10: 000000000005832f
[ 3936.985324] x9 : 616e655f756d705f x8 : ffffff80083d3788
[ 3936.985333] x7 : ffffff8009ea4198 x6 : ffffffc1ffd3ebf0
[ 3936.985340] x5 : ffffffc1ffd3ebf0 x4 : 0000000000000000
[ 3936.985347] x3 : ffffffc1ffd447f8 x2 : ffffffc1ffd3ebf0
[ 3936.985355] x1 : ffffffc1e0b48e00 x0 : 0000000000000089

[ 3936.985366] —[ end trace b444cbcff3b3a1ab ]—
[ 3936.989332] Call trace:
[ 3936.989747] [] nvgpu_pmu_enable_elpg+0x22c/0x2d0 [nvgpu]
[ 3936.990157] [] nvgpu_pmu_pg_global_enable+0x100/0x108 [nvgpu]
[ 3936.990562] [] nvgpu_pg_elpg_enable+0xb0/0xc8 [nvgpu]
[ 3936.990961] [] mc_gp10b_isr_stall+0x1dc/0x218 [nvgpu]
[ 3936.991371] [] nvgpu_intr_thread_stall+0x50/0x1d8 [nvgpu]
[ 3936.991775] [] gk20a_intr_thread_stall+0x20/0x30 [nvgpu]
[ 3936.991786] [] irq_thread_fn+0x30/0x80
[ 3936.991792] [] irq_thread+0x11c/0x1a8
[ 3936.991799] [] kthread+0xec/0xf0
[ 3936.991806] [] ret_from_fork+0x10/0x30

After this my system becomes dead slow. What is this error and how to fix it? Any clue? I am using Jetson Xavier NX with Jetpack 4.4

hello JSP_1,

may I also know what’s the scenario to reproduce this issue?
for example, which power-mode you’d configured as, which process you’d executed, how long you’d seen this issue?

Hi JerryChang,
I am using 15W 6 core power mode. I am executing an capture application which capture the images from a MIPI camera and doing image analysis using open CV CUDA APIs. This issue appears randomly. After power reboot the system is working fine and appears randomly.

Whats the meaning of this issue? Why it is crashing?

hello JSP_1,

since there’s error reported from GPU side while issue reproduced.
could you please help to narrow down the issue to ensure your MIPI camera stream is stable, please check developer guide, Approaches for Validating and Testing the V4L2 Driver, you may enable camera stream to ensure the stability.

Hi JerryChang,
Our MIPI camera is streaming properly. Is this problem related to power? Any other issue which you are thinking of…?

hello JSP_1,

are you having messages it reports System throttle due to Over-current?

please check similar discussion thread, Topic 167029 System throttled due to over-current? for reference,
this seems running some DeepStream or M/L model will hit the issue, and it’ll impacts the performance.

Hi JerryChang,
I am not seeing any throttle message. Yes, the performance goes very bad after the crash messages in the dmesg

Do you have any work around or fixes? What is the suggested course of action? Any help?

hello JSP_1,

could you please have a try to configure as MaxN performance mode for running your use-case?
please check developer guide, Supported Modes and Power Efficiency for reference,

We are using mode 2. This is the MaxN mode for Xavier NX right? This problem occurs every time after running it for around 45 minutes. Then it needs reboot to recover.

hello JSP_1,

it’s mode-0 for the MaxN performance mode configuration,
it seems only put two CPUs online, but this mode boost the CPU frequency to maximum, 1900MHz.

Hi Jerry Chang,
I am operating the system in 6 core 15W mode. Do you want me to try with 2CPU cores…? But, It will affect the overall system performance right? We have been using all 6 cores mostly up to 80% of CPU bandwidth. So if I have to use only 2 cores then I can not use Xavier NX for my system. Do you want to try only for testing?

hello JSP_1,

yes, mode-0 of the MaxN for the test purpose.
6-core will have better performance if you’re having code implementation with multi-threads.