Xavier SOM can't boot normally

I have two SOM to debug our custom board.
But with the save image, the SOMs are different. One boot up normally and the other reboot repeatedly.
I have flashed the bad one several times but same. The bad one can boot up sometime, but the log is different from the right one. the main log is:

[   23.381679] INFO: rcu_preempt self-detected stall on CPU
[   23.381832] 	0-...: (1 GPs behind) idle=5d3/140000000000002/0 softirq=280/291 fqs=2107 
[   23.381978] 	 (t=5250 jiffies g=-157 c=-158 q=814)
[   48.565686] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:2:1168]
[   48.566118] Kernel panic - not syncing: softlockup: hung tasks
[   48.566226] CPU: 0 PID: 1168 Comm: kworker/0:2 Tainted: G             L  4.9.140-tegra #1
[   48.566370] Hardware name: Jetson-AGX (DT)
[   48.566451] Workqueue: events sdhci_delayed_detect
[   48.566541] Call trace:
[   48.566596] [<ffffff800808bdb8>] dump_backtrace+0x0/0x198
[   48.566690] [<ffffff800808c37c>] show_stack+0x24/0x30
[   48.566796] [<ffffff800845d820>] dump_stack+0x98/0xc0
[   48.566888] [<ffffff80081c2198>] panic+0x11c/0x298
[   48.566980] [<ffffff80081824b0>] watchdog_unpark_threads+0x0/0x98
[   48.567086] [<ffffff800813a738>] __hrtimer_run_queues+0xd8/0x360
[   48.567202] [<ffffff800813b088>] hrtimer_interrupt+0xa8/0x1e0
[   48.567300] [<ffffff8008bf52e8>] arch_timer_handler_phys+0x38/0x58
[   48.567692] [<ffffff8008127c68>] handle_percpu_devid_irq+0x90/0x2b0
[   48.568163] [<ffffff800812224c>] generic_handle_irq+0x34/0x50
[   48.568606] [<ffffff8008122930>] __handle_domain_irq+0x68/0xc0
[   48.570728] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[   48.575890] [<ffffff8008082be8>] el1_irq+0xe8/0x18c
[   48.580533] [<ffffff80080bb298>] irq_exit+0xd0/0x118
[   48.585605] [<ffffff8008122934>] __handle_domain_irq+0x6c/0xc0
[   48.591465] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[   48.596809] [<ffffff8008082be8>] el1_irq+0xe8/0x18c
[   48.601615] [<ffffff8008f59468>] _raw_spin_unlock_irqrestore+0x30/0x60
[   48.608178] [<ffffff80080eaa08>] try_to_wake_up+0x208/0x3a8
[   48.613693] [<ffffff80080eabd0>] wake_up_process+0x28/0x38
[   48.619381] [<ffffff8008125718>] __setup_irq+0x438/0x5c8
[   48.624891] [<ffffff8008125a98>] request_threaded_irq+0xf8/0x1c0
[   48.630926] [<ffffff80081286f4>] devm_request_threaded_irq+0x7c/0xe0
[   48.637316] [<ffffff8008bb392c>] mmc_gpiod_request_cd_irq+0x94/0xc8
[   48.643527] [<ffffff8008ba6350>] mmc_start_host+0x68/0xc0
[   48.648696] [<ffffff8008ba7a4c>] mmc_add_host+0x6c/0xb8
[   48.654202] [<ffffff8008bbadbc>] __sdhci_add_host+0x154/0x3f8
[   48.659807] [<ffffff8008bbce7c>] sdhci_add_host+0x2c/0x38
[   48.665245] [<ffffff8008bc27b0>] sdhci_delayed_detect+0x28/0x1f8
[   48.671190] [<ffffff80080d4f3c>] process_one_work+0x1e4/0x4b0
[   48.676965] [<ffffff80080d5258>] worker_thread+0x50/0x4c8
[   48.682289] [<ffffff80080dbee4>] kthread+0xec/0xf0
[   48.687276] [<ffffff8008083850>] ret_from_fork+0x10/0x40
[   48.692617] SMP: stopping secondary CPUs
[   48.696303] Kernel Offset: disabled
[   48.700223] Memory Limit: none
[   48.703374] trusty-log panic notifier - trusty version Built: 22:43:40 Dec  9 2019 [   48.717008] Rebooting in 5 seconds..

the all log as attachments.
RightSOM.log (225.1 KB)

BadSOM.log (116.0 KB)

@kayccc

Is this image a pure one from sdkmanager?

Is this SOM working fine on devkit?

Hi WayneWWW,

I have changed some device tree for our board.
And the Xavier SOM can’t be put on devkit. The devkit is a full kit with a fan and I can’t separate it.

Hi,

You could just remove the fan and replace the module. We need to know if this is defect module or should investigate the custom carrier board.

hi,

thank you for your reply. I will try it.

Hi WayneWWW,

I test the SOM on devkit. It works ok.
If it is our custom carrier board problem, why the other one is OK.
Could you give me some infomation by the log of the two SOMs?
Thank you.

Do you have any sdcard working on the board? The error you pasted shows the log in sdhci driver

[ 48.566451] Workqueue: events sdhci_delayed_detect

Also, I don’t see any error in your “BadSOM.log”. The log shows it boots into the system. Where does that partial log you pasted come from? Does this issue happen randomly?

I have no any sdcard on board.
the Bad SOM can boot up after rebooting several times. the log:

[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.9.140-tegra (buildbrain@mobile-u64-1935) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05) ) #1 SMP PREEMPT Mon Dec 9 22:52:02 PST 2019
[    0.000000] Boot CPU: AArch64 Processor [4e0f0040]
[    0.000000] OF: fdt:memory scan node memory, reg size 48,
[    0.000000] OF: fdt: - 80000000 ,  2c000000
[    0.000000] OF: fdt: - ac200000 ,  44600000
[    0.000000] OF: fdt: - 100000000 ,  380000000
[    0.000000] earlycon: tegra_comb_uart0 at MMIO32 0x000000000c168000 (options '')
[    0.000000] bootconsole [tegra_comb_uart0] enabled
<hit enter to activate fiq debugger>
[    1.872356] tegra-slvs-ec 15ac0000.slvs-ec: probe failed: -19
[    2.247287] tegra210-axbar tegra210-axbar: Can't retrieve parent clock
[   23.258313] INFO: rcu_preempt self-detected stall on CPU
[   23.258497] 	0-...: (1 GPs behind) idle=4a3/140000000000002/0 softirq=282/283 fqs=2503 
[   23.258652] 	 (t=5250 jiffies g=-163 c=-164 q=845)
[   48.530315] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:2:1168]
[   48.530747] Kernel panic - not syncing: softlockup: hung tasks
[   48.530868] CPU: 0 PID: 1168 Comm: kworker/0:2 Tainted: G             L  4.9.140-tegra #1
[   48.531028] Hardware name: Jetson-AGX (DT)
[   48.531114] Workqueue: events sdhci_delayed_detect
[   48.531226] Call trace:
[   48.531284] [<ffffff800808bdb8>] dump_backtrace+0x0/0x198
[   48.531401] [<ffffff800808c37c>] show_stack+0x24/0x30
[   48.531516] [<ffffff800845d820>] dump_stack+0x98/0xc0
[   48.531622] [<ffffff80081c2198>] panic+0x11c/0x298
[   48.531752] [<ffffff80081824b0>] watchdog_unpark_threads+0x0/0x98
[   48.531893] [<ffffff800813a738>] __hrtimer_run_queues+0xd8/0x360
[   48.532008] [<ffffff800813b088>] hrtimer_interrupt+0xa8/0x1e0
[   48.532136] [<ffffff8008bf52e8>] arch_timer_handler_phys+0x38/0x58
[   48.532296] [<ffffff8008127c68>] handle_percpu_devid_irq+0x90/0x2b0
[   48.532790] [<ffffff800812224c>] generic_handle_irq+0x34/0x50
[   48.533227] [<ffffff8008122930>] __handle_domain_irq+0x68/0xc0
[   48.535119] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[   48.540549] [<ffffff8008082be8>] el1_irq+0xe8/0x18c
[   48.545184] [<ffffff80080bb298>] irq_exit+0xd0/0x118
[   48.550254] [<ffffff8008122934>] __handle_domain_irq+0x6c/0xc0
[   48.555857] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[   48.561453] [<ffffff8008082be8>] el1_irq+0xe8/0x18c
[   48.566269] [<ffffff8008f59468>] _raw_spin_unlock_irqrestore+0x30/0x60
[   48.572842] [<ffffff80080eaa08>] try_to_wake_up+0x208/0x3a8
[   48.578615] [<ffffff80080eabd0>] wake_up_process+0x28/0x38
[   48.584027] [<ffffff8008125718>] __setup_irq+0x438/0x5c8
[   48.589547] [<ffffff8008125a98>] request_threaded_irq+0xf8/0x1c0
[   48.595580] [<ffffff80081286f4>] devm_request_threaded_irq+0x7c/0xe0
[   48.601969] [<ffffff8008bb392c>] mmc_gpiod_request_cd_irq+0x94/0xc8
[   48.608192] [<ffffff8008ba6350>] mmc_start_host+0x68/0xc0
[   48.613614] [<ffffff8008ba7a4c>] mmc_add_host+0x6c/0xb8
[   48.618851] [<ffffff8008bbadbc>] __sdhci_add_host+0x154/0x3f8
[   48.624452] [<ffffff8008bbce7c>] sdhci_add_host+0x2c/0x38
[   48.629878] [<ffffff8008bc27b0>] sdhci_delayed_detect+0x28/0x1f8
[   48.635841] [<ffffff80080d4f3c>] process_one_work+0x1e4/0x4b0
[   48.641614] [<ffffff80080d5258>] worker_thread+0x50/0x4c8
[   48.646943] [<ffffff80080dbee4>] kthread+0xec/0xf0
[   48.651670] [<ffffff8008083850>] ret_from_fork+0x10/0x40
[   48.657271] SMP: stopping secondary CPUs
[   48.661219] Kernel Offset: disabled
[   48.664874] Memory Limit: none
[   48.668025] trusty-log panic notifier - trusty version Built: 22:43:40 Dec  9 2019 [   48.681182] Rebooting in 5 seconds..

Even if it boot up, there are some different log in BadSOM.log. The main different is:

[    2.599715] Freeing unused kernel memory: 8576K
[    2.628636] Root device found: mmcblk0p1
[    2.629651] Found dev node: /dev/mmcblk0p1
[    2.834837] tegra_cec 3960000.tegra_cec: Can't find physical addresse.
[    2.834844] tegra_cec 3960000.tegra_cec: tegra_cec_init Done.
[    4.136745] irq 250: nobody cared (try booting with the "irqpoll" option)
[    4.136930] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.140-tegra #1
[    4.136933] Hardware name: Jetson-AGX (DT)
[    4.136936] Call trace:
[    4.136954] [<ffffff800808bdb8>] dump_backtrace+0x0/0x198
[    4.136965] [<ffffff800808c37c>] show_stack+0x24/0x30
[    4.136980] [<ffffff800845d820>] dump_stack+0x98/0xc0
[    4.136992] [<ffffff80081261d4>] __report_bad_irq+0x3c/0xf8
[    4.137002] [<ffffff8008126638>] note_interrupt+0x2c8/0x318
[    4.137011] [<ffffff80081234d0>] handle_irq_event_percpu+0x50/0x60
[    4.137020] [<ffffff8008123530>] handle_irq_event+0x50/0x80
[    4.137029] [<ffffff8008127168>] handle_edge_irq+0xb0/0x178
[    4.137055] [<ffffff800812224c>] generic_handle_irq+0x34/0x50
[    4.137062] [<ffffff80084cf560>] tegra_gpio_irq_handler_desc+0x1e8/0x268
[    4.137067] [<ffffff800812224c>] generic_handle_irq+0x34/0x50
[    4.137073] [<ffffff8008122930>] __handle_domain_irq+0x68/0xc0
[    4.137078] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[    4.137082] [<ffffff8008082be8>] el1_irq+0xe8/0x18c
[    4.137113] [<ffffff8008108e34>] load_balance+0x174/0xa20
[    4.137118] [<ffffff8008109ffc>] rebalance_domains+0x1a4/0x2c8
[    4.137124] [<ffffff800810a274>] run_rebalance_domains+0x154/0x218
[    4.137146] [<ffffff8008081054>] __do_softirq+0x13c/0x3b0
[    4.137152] [<ffffff80080bb298>] irq_exit+0xd0/0x118
[    4.137156] [<ffffff8008122934>] __handle_domain_irq+0x6c/0xc0
[    4.137161] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[    4.137165] [<ffffff8008082be8>] el1_irq+0xe8/0x18c
[    4.137172] [<ffffff8008b9b330>] cpuidle_enter_state+0xb8/0x380
[    4.137177] [<ffffff8008b9b66c>] cpuidle_enter+0x34/0x48
[    4.137181] [<ffffff8008112a1c>] call_cpuidle+0x44/0x70
[    4.137185] [<ffffff8008112d98>] cpu_startup_entry+0x1b0/0x200
[    4.137191] [<ffffff8008f507b4>] rest_init+0x84/0x90
[    4.137198] [<ffffff80095e0b64>] start_kernel+0x370/0x384
[    4.137214] [<ffffff80095e0204>] __primary_switched+0x80/0x94
[    4.137216] handlers:
[    4.137268] [<ffffff80081235d0>] irq_default_primary_handler threaded [<ffffff8008bb3960>] mmc_gpio_cd_irqt
[    4.137445] Disabling IRQ #250
[    4.199566] EXT4-fs (mmcblk0p1): recovery complete
[    4.200376] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: (null)
[    4.201344] Rootfs mounted over mmcblk0p1
[    4.215608] Switching from initrd to actual rootfs
[    4.303992] systemd[1]: System time before build time, advancing clock.
[    4.323647] ip_tables: (C) 2000-2006 Netfilter Core Team
[    4.326689] cgroup: cgroup2: unknown option "nsdelegate"
[    4.334635] systemd[1]: systemd 237 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
[    4.335220] systemd[1]: Detected architecture arm64.
[    4.340082] systemd[1]: Set hostname to <ubuntu>.
[    4.407010] systemd[1]: File /lib/systemd/system/systemd-journald.service:36 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
[    4.407020] systemd[1]: Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[    4.539215] systemd[1]: Set up automount Arbitrary Executable File Formats File System Automount Point.
[    4.542450] systemd[1]: Created slice System Slice.
[    4.542719] systemd[1]: Listening on RPCbind Server Activation Socket.
[    4.542939] systemd[1]: Listening on /dev/initctl Compatibility Named Pipe.
[    4.543219] systemd[1]: Listening on udev Control Socket.
[    4.580017] EXT4-fs (mmcblk0p1): re-mounted. Opts: (null)
[    4.620671] nvgpu: 17000000.gv11b          nvgpu_nvhost_syncpt_init:291  [INFO]  syncpt_unit_base 60000000 syncpt_unit_size 400000 size 1000

I have no any sdcard on board.

Do you mean you don’t need a sdcard on board or just you don’t have sdcard on board currently?

Our board don’t need sdcard, so there is no sdcard circuit on board.

Hi,

Could you disable 3400000.sdhci in your device tree?

Ok, I will test it.

Hi WayneWWW,

You are right, the SOM works fine when I disable 3400000.sdhci in my device tree.
Thanks a lot.