Memory errors on AGX Xavier 64GB with jetpack 4.6.3

utilisateur1559 · October 5, 2023, 10:11am

Hello,

We are facing some errors on agx xavier 64GB, here are the logs :

[ 4.693191] mc-err: (255) csr_nvl1rhp: EMEM address decode error
[ 4.693223] mc-err: status = 0x200000f5; addr = 0x33487200; hi_adr_reg=008
[ 4.693236] mc-err: secure: no, access-type: read
[ 4.693261] mc-err: (255) csr_nvl2rhp: EMEM address decode error
[ 4.693273] mc-err: status = 0x200000f6; addr = 0x33eb6200; hi_adr_reg=008
[ 4.693284] mc-err: secure: no, access-type: read
[ 4.693301] mc-err: mcerr: unknown intr source intstatus = 0x00000000, intstatus_1 = 0x00000000

[ 4.860839] mc-err: (255) csr_nvl1rhp: EMEM address decode error
[ 4.860870] mc-err: status = 0x200000f5; addr = 0x337e7200; hi_adr_reg=008
[ 4.860882] mc-err: secure: no, access-type: read
[ 4.860898] mc-err: Too many MC errors; throttling prints

[ 14.861373] nvgpu: 17000000.gv11b __nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ nvgpu_flcn_wait_for_halt+0x40/0xa0 [nvgpu]
[ 14.861410] nvgpu: 17000000.gv11b nvgpu_gm20b_acr_wait_for_completion:1059 [ERR] flcn-0: ACR boot timed out
[ 14.861506] nvgpu: 17000000.gv11b gk20a_finalize_poweron:328 [ERR] ACR bootstrap failed
[ 14.861557] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 15.666374] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 16.298532] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 16.927514] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 17.560079] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 18.190518] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set

the full log is attached

We are using our custom board with jetpack 4.6.3, There is no errors with agx xavier 32gb

We are not facing this issue on jetpack 5.1.1 on both agx xavier 32gb and 64gb. but right now we need the jetpack 4.6.x to be functional.

Do you have any idea on what could be the problem ?
Thank you
dmesg.txt (55.0 KB)

linuxdev · October 5, 2023, 12:42pm

I doubt I can answer this, but a question comes to mind: Is this the official developer’s kit, or is it a module mounted on a third party carrier board? I ask because it is related to loading the SPI driver, and if the device tree is not correct, then there is no telling what would happen upon load of the driver. Also, there would likely be a difference in device tree for 32 GB and 64 GB models, so if you’ve simply copied from one to the other, then this might also be a case of a subset of the device tree being incorrect. If you can confirm that the correct software was flashed (the right flash target in combination with the right device tree…which is automatically correct for dev kits, but different for third party carrier boards), then it becomes a software issue in need of debugging instead of just a release version of which software is used.

utilisateur1559 · October 5, 2023, 1:22pm

Hello thanks for the answer,

As I said in my first message its on our custom carrier board so yes it’s a module ! And I think agx xavier 64gb does not have a devkit.
It has nothing to do with spi. Its just a warning. I just removed the spidev driver and I still have the nvgpu and mc errors. I attached the new dmesg if needed.

Yes I am using the same image for agx xavier 32gb and 64gb, it looks like its the same device tree for linux kernel, the nvidia flashing tool detects witch module is being flashed and uses the right dtb for bootloader which will set the right amount of memory for the kernel, please let me know if I am wrong. The amount of memory printed in /proc/meminfo is right for both agx xavier 32gb and 64gb

As I said on jetpack 5.1.1 both agx xavier 32gb and 64gb are working well using the same image on our carrier board, that confirms that it should not be a device tree issue ! At least from our modifications for our custom carrier board.

Thank you
dmesg.txt (52.4 KB)

utilisateur1559 · October 6, 2023, 2:23pm

Hello,
I just updated the system to latest jetpack 4.6.4.
We still have the issue here are the logs:

[ 5.563491] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 6.706648] irq 84: nobody cared (try booting with the “irqpoll” option)
[ 6.707198] CPU: 0 PID: 3140 Comm: nvpmodel Not tainted 4.9.337-l4t-r32.7.4+g03dc005d9a50 #1
[ 6.707991] Call trace:
[ 6.708373] [<00000000d59a827e>] dump_backtrace+0x0/0x178
[ 6.708801] [<000000006082020c>] show_stack+0x24/0x30
[ 6.709242] [<0000000094c8b723>] dump_stack+0xa4/0xc8
[ 6.709596] [<0000000091526135>] __report_bad_irq+0x54/0xe0
[ 6.709986] [<00000000c47236b9>] note_interrupt+0x280/0x2e8
[ 6.710310] [<00000000362e7049>] handle_irq_event_percpu+0x5c/0x70
[ 6.710630] [<00000000414caf65>] handle_irq_event+0x50/0x80
[ 6.710950] [<00000000c928ef9f>] handle_fasteoi_irq+0xd0/0x1b0
[ 6.711307] [<000000002707fe73>] generic_handle_irq+0x34/0x50
[ 6.711625] [<00000000c49cba50>] __handle_domain_irq+0x6c/0xc0
[ 6.711940] [<00000000ecbaac10>] gic_handle_irq+0x54/0xa8
[ 6.712318] [<0000000007801459>] el1_irq+0xe8/0x194
[ 6.712661] [<000000006af4046d>] irq_exit+0xd4/0x110
[ 6.713055] [<0000000019af5326>] __handle_domain_irq+0x70/0xc0
[ 6.713482] [<00000000ecbaac10>] gic_handle_irq+0x54/0xa8
[ 6.713774] [<0000000007801459>] el1_irq+0xe8/0x194
[ 6.719030] [<0000000098edfaf9>] __nvgpu_readl+0x0/0xb8 [nvgpu]
[ 6.724159] [<00000000ea6369b4>] gm20b_bus_bar1_bind+0xc4/0x110 [nvgpu]
[ 6.729505] [<00000000c79c9767>] gk20a_init_mm_setup_hw+0x8c/0x118 [nvgpu]
[ 6.734911] [<00000000c3b78ba5>] gv11b_init_mm_setup_hw+0x4c/0x1b8 [nvgpu]
[ 6.740229] [<000000001a75c725>] nvgpu_init_mm_support+0x90/0xa0 [nvgpu]
[ 6.745765] [<0000000016886075>] gk20a_finalize_poweron+0x4ac/0x930 [nvgpu]
[ 6.751072] [<00000000c51c8002>] gk20a_pm_finalize_poweron+0xe4/0x3f0 [nvgpu]
[ 6.756344] [<000000009700880d>] gk20a_pm_runtime_resume+0x3c/0x70 [nvgpu]
[ 6.756710] [<0000000070849330>] pm_generic_runtime_resume+0x3c/0x58
[ 6.757062] [<000000000cfa3653>] __rpm_callback+0x74/0xa0
[ 6.757356] [<000000008cfd6936>] rpm_callback+0x34/0x98
[ 6.757647] [<00000000a8b8212b>] rpm_resume+0x58c/0x748
[ 6.758017] [<00000000b87d2c42>] pm_runtime_forbid+0x50/0x68
[ 6.758331] [<0000000055113db5>] control_store+0xac/0x100
[ 6.758625] [<00000000d62f70fd>] dev_attr_store+0x44/0x60
[ 6.758930] [<00000000cae85ead>] sysfs_kf_write+0x5c/0x70
[ 6.759254] [<000000003ccb5571>] kernfs_fop_write+0xc0/0x1d8
[ 6.759544] [<00000000ac1ea057>] __vfs_write+0x48/0x128
[ 6.759850] [<00000000ec49f52d>] vfs_write+0xac/0x1b0
[ 6.760288] [<00000000be871c88>] SyS_write+0x5c/0xc8
[ 6.760617] [<00000000f652b6eb>] __sys_trace_return+0x0/0x4
[ 6.760904] handlers:
[ 6.761195] [<000000004601bb44>] tegra_mcerr_hard_irq threaded [<00000000be662190>] tegra_mcerr_thread
[ 6.761932] Disabling IRQ #84
[ 6.763195] mc-err: (255) csr_nvl2rhp: EMEM address decode error
[ 6.763263] mc-err: status = 0x200000f6; addr = 0x2d996200; hi_adr_reg=008
[ 6.763358] mc-err: secure: no, access-type: read
[ 7.102959] net eth0: get_configure_l3v4_filter →
[ 7.103544] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 15.241560] audit: type=1006 audit(1600598649.300:2): pid=3187 uid=0 old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1 res=1
[ 16.786551] nvgpu: 17000000.gv11b __nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ nvgpu_flcn_wait_for_halt+0x40/0xa0 [nvgpu]
[ 16.786601] nvgpu: 17000000.gv11b nvgpu_gm20b_acr_wait_for_completion:1059 [ERR] flcn-0: ACR boot timed out
[ 16.786698] nvgpu: 17000000.gv11b gk20a_finalize_poweron:328 [ERR] ACR bootstrap failed
[ 17.535999] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 18.275885] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 19.140325] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 19.896886] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 20.739703] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set

Please find the full log attached.
Please let me know if you have any idea on what could be the problem and how to fix it.

Thanks
dmesg-4.6.4.txt (54.6 KB)

linuxdev · October 6, 2023, 8:56pm

I can’t give you a specific answer. I will still suggest that there is a strong possibility that this is related to device tree, although it is not guaranteed.

Some random information about that stack dump:

This is from an IRQ handler.
An interrupt has resulted in trying to relate a driver to that IRQ.
For a number of possible reasons, there has been a failure to associate this hardware IRQ with the particular hardware and driver. Quite often, for a non-plug-n-play device which cannot self-report, this is the function of the device tree.
This is related to the hardware which the gk20a driver runs, but the gk20a itself has been found and is in turn triggering the next hardware in the chain, including the nvgpu; this too is not the failure point, it is a chain of hardware.
Finally, the exception level 1 (kernel mode, where drivers live, is el1) reaches for hardware in the chain of hardware interrupts. The IRQ is not handled. I can’t tell you which hardware it is looking for because the kernel itself does not know. This is why it is stack dumping.

Someone with better knowledge of those specific drivers can probably suggest what kind of hardware it is looking for. It is extremely likely that something in your carrier board lane routing has changed, or a hardware setup address has changed, and that as a result the device tree has a fragment which is no longer valid.

utilisateur1559 · October 9, 2023, 9:02am

Hello,
Is there a real person with better knowledge available to answer this question as linuxdev suggested please ?
Thank you

WayneWWW · October 9, 2023, 9:18am

Are you able to test this on NV devkit instead of your board?

We don’t know you or your board. It would be better clarifying if this is defective module on NV devkit first.

utilisateur1559 · October 9, 2023, 4:41pm

Hello,
The module is not defective, there is no issue with jetpack 5 on our custom board.

I did the test on nvidia devkit with a 64GB module and jetpack 4.6.4, there was no issue.

I did again the test on our board with 64GB module and jetpack 4.6.4:

I removed all modules except nvgpu.ko → still have the issue
I removed nvgpu.ko → no more issue
If I load nvgpu.ko manually after boot → no issue as you can see in the attached logs

Please let me know if you have any idea on how to fix this.
Thank you
dmesg-manual-load.txt (48.5 KB)

WayneWWW · October 10, 2023, 4:47am

I did the test on nvidia devkit with a 64GB module and jetpack 4.6.4, there was no issue.

Then could you clarify what is the software difference when it runs on your board and when it runs on custom board?

utilisateur1559 · October 16, 2023, 3:18pm

Hello,
We solved the problem thank you, there was an issue with our custom dts. Some parts of plugin-manager were missing (gpu-64gb-disable-l3).

Topic		Replies	Views
An nvgpu error causes all GPu-dependent services to fail Jetson AGX Xavier cuda	5	684	March 6, 2024
Jetson Xavier NX GPU lib Report Error Jetson Xavier NX board-design , gpu	40	2081	June 21, 2022
Jetson Xavier AGX is not booting - blank screen Jetson AGX Xavier boot , reflash , kernel , usb , power	10	903	March 26, 2023
Jetson Xavier AGX nvgpu_timeout_expired Jetson AGX Xavier	30	2025	December 29, 2020
AGX Xavier - Boot Hangs - what does this error mean? Jetson AGX Xavier boot , reflash	16	369	November 8, 2024
AGX Xavier error during flashing Jetson AGX Xavier reflash	16	2725	January 14, 2021
Jetpack 4.6 Xavier boot crash when BPMP-NOC module read Incrementing timeout Jetson AGX Xavier boot , board-design	5	787	August 25, 2022
Xavier AGX no longer boots into GUI-Destop after crash/reboot Jetson AGX Xavier boot	3	667	November 25, 2020
Question about AGX Xavier broken eMMC Jetson AGX Xavier	18	1477	September 11, 2023
Nvidia xavier emmc corruption after reboot when using jetpack 5.1.2 Jetson AGX Xavier boot , board-design	37	1828	November 15, 2023

Memory errors on AGX Xavier 64GB with jetpack 4.6.3

Related topics