Memory errors on AGX Xavier 64GB with jetpack 4.6.3

Hello,

We are facing some errors on agx xavier 64GB, here are the logs :

[ 4.693191] mc-err: (255) csr_nvl1rhp: EMEM address decode error
[ 4.693223] mc-err: status = 0x200000f5; addr = 0x33487200; hi_adr_reg=008
[ 4.693236] mc-err: secure: no, access-type: read
[ 4.693261] mc-err: (255) csr_nvl2rhp: EMEM address decode error
[ 4.693273] mc-err: status = 0x200000f6; addr = 0x33eb6200; hi_adr_reg=008
[ 4.693284] mc-err: secure: no, access-type: read
[ 4.693301] mc-err: mcerr: unknown intr source intstatus = 0x00000000, intstatus_1 = 0x00000000

[ 4.860839] mc-err: (255) csr_nvl1rhp: EMEM address decode error
[ 4.860870] mc-err: status = 0x200000f5; addr = 0x337e7200; hi_adr_reg=008
[ 4.860882] mc-err: secure: no, access-type: read
[ 4.860898] mc-err: Too many MC errors; throttling prints

[ 14.861373] nvgpu: 17000000.gv11b __nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ nvgpu_flcn_wait_for_halt+0x40/0xa0 [nvgpu]
[ 14.861410] nvgpu: 17000000.gv11b nvgpu_gm20b_acr_wait_for_completion:1059 [ERR] flcn-0: ACR boot timed out
[ 14.861506] nvgpu: 17000000.gv11b gk20a_finalize_poweron:328 [ERR] ACR bootstrap failed
[ 14.861557] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 15.666374] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 16.298532] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 16.927514] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 17.560079] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 18.190518] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set

the full log is attached

We are using our custom board with jetpack 4.6.3, There is no errors with agx xavier 32gb

We are not facing this issue on jetpack 5.1.1 on both agx xavier 32gb and 64gb. but right now we need the jetpack 4.6.x to be functional.

Do you have any idea on what could be the problem ?
Thank you
dmesg.txt (55.0 KB)

I doubt I can answer this, but a question comes to mind: Is this the official developer’s kit, or is it a module mounted on a third party carrier board? I ask because it is related to loading the SPI driver, and if the device tree is not correct, then there is no telling what would happen upon load of the driver. Also, there would likely be a difference in device tree for 32 GB and 64 GB models, so if you’ve simply copied from one to the other, then this might also be a case of a subset of the device tree being incorrect. If you can confirm that the correct software was flashed (the right flash target in combination with the right device tree…which is automatically correct for dev kits, but different for third party carrier boards), then it becomes a software issue in need of debugging instead of just a release version of which software is used.

Hello thanks for the answer,

As I said in my first message its on our custom carrier board so yes it’s a module ! And I think agx xavier 64gb does not have a devkit.
It has nothing to do with spi. Its just a warning. I just removed the spidev driver and I still have the nvgpu and mc errors. I attached the new dmesg if needed.

Yes I am using the same image for agx xavier 32gb and 64gb, it looks like its the same device tree for linux kernel, the nvidia flashing tool detects witch module is being flashed and uses the right dtb for bootloader which will set the right amount of memory for the kernel, please let me know if I am wrong. The amount of memory printed in /proc/meminfo is right for both agx xavier 32gb and 64gb

As I said on jetpack 5.1.1 both agx xavier 32gb and 64gb are working well using the same image on our carrier board, that confirms that it should not be a device tree issue ! At least from our modifications for our custom carrier board.

Thank you
dmesg.txt (52.4 KB)

Hello,
I just updated the system to latest jetpack 4.6.4.
We still have the issue here are the logs:

[ 5.563491] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 6.706648] irq 84: nobody cared (try booting with the “irqpoll” option)
[ 6.707198] CPU: 0 PID: 3140 Comm: nvpmodel Not tainted 4.9.337-l4t-r32.7.4+g03dc005d9a50 #1
[ 6.707991] Call trace:
[ 6.708373] [<00000000d59a827e>] dump_backtrace+0x0/0x178
[ 6.708801] [<000000006082020c>] show_stack+0x24/0x30
[ 6.709242] [<0000000094c8b723>] dump_stack+0xa4/0xc8
[ 6.709596] [<0000000091526135>] __report_bad_irq+0x54/0xe0
[ 6.709986] [<00000000c47236b9>] note_interrupt+0x280/0x2e8
[ 6.710310] [<00000000362e7049>] handle_irq_event_percpu+0x5c/0x70
[ 6.710630] [<00000000414caf65>] handle_irq_event+0x50/0x80
[ 6.710950] [<00000000c928ef9f>] handle_fasteoi_irq+0xd0/0x1b0
[ 6.711307] [<000000002707fe73>] generic_handle_irq+0x34/0x50
[ 6.711625] [<00000000c49cba50>] __handle_domain_irq+0x6c/0xc0
[ 6.711940] [<00000000ecbaac10>] gic_handle_irq+0x54/0xa8
[ 6.712318] [<0000000007801459>] el1_irq+0xe8/0x194
[ 6.712661] [<000000006af4046d>] irq_exit+0xd4/0x110
[ 6.713055] [<0000000019af5326>] __handle_domain_irq+0x70/0xc0
[ 6.713482] [<00000000ecbaac10>] gic_handle_irq+0x54/0xa8
[ 6.713774] [<0000000007801459>] el1_irq+0xe8/0x194
[ 6.719030] [<0000000098edfaf9>] __nvgpu_readl+0x0/0xb8 [nvgpu]
[ 6.724159] [<00000000ea6369b4>] gm20b_bus_bar1_bind+0xc4/0x110 [nvgpu]
[ 6.729505] [<00000000c79c9767>] gk20a_init_mm_setup_hw+0x8c/0x118 [nvgpu]
[ 6.734911] [<00000000c3b78ba5>] gv11b_init_mm_setup_hw+0x4c/0x1b8 [nvgpu]
[ 6.740229] [<000000001a75c725>] nvgpu_init_mm_support+0x90/0xa0 [nvgpu]
[ 6.745765] [<0000000016886075>] gk20a_finalize_poweron+0x4ac/0x930 [nvgpu]
[ 6.751072] [<00000000c51c8002>] gk20a_pm_finalize_poweron+0xe4/0x3f0 [nvgpu]
[ 6.756344] [<000000009700880d>] gk20a_pm_runtime_resume+0x3c/0x70 [nvgpu]
[ 6.756710] [<0000000070849330>] pm_generic_runtime_resume+0x3c/0x58
[ 6.757062] [<000000000cfa3653>] __rpm_callback+0x74/0xa0
[ 6.757356] [<000000008cfd6936>] rpm_callback+0x34/0x98
[ 6.757647] [<00000000a8b8212b>] rpm_resume+0x58c/0x748
[ 6.758017] [<00000000b87d2c42>] pm_runtime_forbid+0x50/0x68
[ 6.758331] [<0000000055113db5>] control_store+0xac/0x100
[ 6.758625] [<00000000d62f70fd>] dev_attr_store+0x44/0x60
[ 6.758930] [<00000000cae85ead>] sysfs_kf_write+0x5c/0x70
[ 6.759254] [<000000003ccb5571>] kernfs_fop_write+0xc0/0x1d8
[ 6.759544] [<00000000ac1ea057>] __vfs_write+0x48/0x128
[ 6.759850] [<00000000ec49f52d>] vfs_write+0xac/0x1b0
[ 6.760288] [<00000000be871c88>] SyS_write+0x5c/0xc8
[ 6.760617] [<00000000f652b6eb>] __sys_trace_return+0x0/0x4
[ 6.760904] handlers:
[ 6.761195] [<000000004601bb44>] tegra_mcerr_hard_irq threaded [<00000000be662190>] tegra_mcerr_thread
[ 6.761932] Disabling IRQ #84
[ 6.763195] mc-err: (255) csr_nvl2rhp: EMEM address decode error
[ 6.763263] mc-err: status = 0x200000f6; addr = 0x2d996200; hi_adr_reg=008
[ 6.763358] mc-err: secure: no, access-type: read
[ 7.102959] net eth0: get_configure_l3v4_filter →
[ 7.103544] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 15.241560] audit: type=1006 audit(1600598649.300:2): pid=3187 uid=0 old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1 res=1
[ 16.786551] nvgpu: 17000000.gv11b __nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ nvgpu_flcn_wait_for_halt+0x40/0xa0 [nvgpu]
[ 16.786601] nvgpu: 17000000.gv11b nvgpu_gm20b_acr_wait_for_completion:1059 [ERR] flcn-0: ACR boot timed out
[ 16.786698] nvgpu: 17000000.gv11b gk20a_finalize_poweron:328 [ERR] ACR bootstrap failed
[ 17.535999] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 18.275885] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 19.140325] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 19.896886] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set
[ 20.739703] nvgpu: 17000000.gv11b tpc_pg_mask_store:843 [INFO] no value change, same mask already set

Please find the full log attached.
Please let me know if you have any idea on what could be the problem and how to fix it.

Thanks
dmesg-4.6.4.txt (54.6 KB)

I can’t give you a specific answer. I will still suggest that there is a strong possibility that this is related to device tree, although it is not guaranteed.

Some random information about that stack dump:

  • This is from an IRQ handler.
  • An interrupt has resulted in trying to relate a driver to that IRQ.
  • For a number of possible reasons, there has been a failure to associate this hardware IRQ with the particular hardware and driver. Quite often, for a non-plug-n-play device which cannot self-report, this is the function of the device tree.
  • This is related to the hardware which the gk20a driver runs, but the gk20a itself has been found and is in turn triggering the next hardware in the chain, including the nvgpu; this too is not the failure point, it is a chain of hardware.
  • Finally, the exception level 1 (kernel mode, where drivers live, is el1) reaches for hardware in the chain of hardware interrupts. The IRQ is not handled. I can’t tell you which hardware it is looking for because the kernel itself does not know. This is why it is stack dumping.

Someone with better knowledge of those specific drivers can probably suggest what kind of hardware it is looking for. It is extremely likely that something in your carrier board lane routing has changed, or a hardware setup address has changed, and that as a result the device tree has a fragment which is no longer valid.

Hello,
Is there a real person with better knowledge available to answer this question as linuxdev suggested please ?
Thank you

Are you able to test this on NV devkit instead of your board?

We don’t know you or your board. It would be better clarifying if this is defective module on NV devkit first.

Hello,
The module is not defective, there is no issue with jetpack 5 on our custom board.

I did the test on nvidia devkit with a 64GB module and jetpack 4.6.4, there was no issue.

I did again the test on our board with 64GB module and jetpack 4.6.4:

  • I removed all modules except nvgpu.ko → still have the issue
  • I removed nvgpu.ko → no more issue
  • If I load nvgpu.ko manually after boot → no issue as you can see in the attached logs

Please let me know if you have any idea on how to fix this.
Thank you
dmesg-manual-load.txt (48.5 KB)

I did the test on nvidia devkit with a 64GB module and jetpack 4.6.4, there was no issue.

Then could you clarify what is the software difference when it runs on your board and when it runs on custom board?

Hello,
We solved the problem thank you, there was an issue with our custom dts. Some parts of plugin-manager were missing (gpu-64gb-disable-l3).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.