Hello,
we have just received 4 new Nvidia Jetson TX2 boards and are experiencing serious problems with the embedded emmc. Specifically, after a certain amount of time, the following error is reported by dmesg (we experienced the problem on all the boards)
[ 309.833708] mmc0: Timeout waiting for hardware interrupt.
[ 309.841798] sdhci: =========== REGISTER DUMP (mmc0)===========
[ 309.850284] sdhci: Sys addr: 0x00000028 | Version: 0x00000404
[ 309.858747] sdhci: Blk size: 0x00007200 | Blk cnt: 0x00000000
[ 309.867133] sdhci: Argument: 0x02047161 | Trn mode: 0x0000002b
[ 309.875456] sdhci: Present: 0x01fb00f0 | Host ctl: 0x00000035
[ 309.883728] sdhci: Power: 0x00000001 | Blk gap: 0x00000000
[ 309.891990] sdhci: Wake-up: 0x00000000 | Clock: 0x00000007
[ 309.900177] sdhci: Timeout: 0x0000000e | Int stat: 0x00000000
[ 309.908302] sdhci: Int enab: 0x02ff000b | Sig enab: 0x02fc000b
[ 309.916414] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[ 309.924469] sdhci: Caps: 0x3f6cd08c | Caps_1: 0x18006f73
[ 309.932484] sdhci: Cmd: 0x0000193a | Max curr: 0x00000000
[ 309.940480] sdhci: Host ctl2: 0x0000300d
[ 309.946499] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x0000000080000010
[ 309.955129] sdhci: ===========================================
[ 310.793689] mmc1: Timeout waiting for hardware interrupt.
[ 310.801201] sdhci: =========== REGISTER DUMP (mmc1)===========
[ 310.809110] sdhci: Sys addr: 0x00000000 | Version: 0x00000404
[ 310.817021] sdhci: Blk size: 0x00007040 | Blk cnt: 0x00000000
[ 310.824869] sdhci: Argument: 0xa5000040 | Trn mode: 0x00000003
[ 310.832664] sdhci: Present: 0x01fb0000 | Host ctl: 0x00000017
[ 310.840449] sdhci: Power: 0x00000001 | Blk gap: 0x00000000
[ 310.848164] sdhci: Wake-up: 0x00000000 | Clock: 0x00000007
[ 310.855814] sdhci: Timeout: 0x0000000e | Int stat: 0x00000000
[ 310.863438] sdhci: Int enab: 0x02ff000b | Sig enab: 0x02fc000b
[ 310.871004] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[ 310.878519] sdhci: Caps: 0x3f6cd08c | Caps_1: 0x18006f73
[ 310.886029] sdhci: Cmd: 0x0000353a | Max curr: 0x00000000
[ 310.893479] sdhci: Host ctl2: 0x0000300b
[ 310.898976] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x00000000fc200010
[ 310.907104] sdhci: ===========================================
[ 317.853659] INFO: rcu_preempt self-detected stall on CPU
[ 317.860674] 0-...: (1 GPs behind) idle=f7b/140000000000002/0 softirq=40730/40731 fqs=5217
[ 317.872439] (t=5250 jiffies g=19667 c=19666 q=25194)
[ 317.879390] Task dump for CPU 0:
[ 317.884405] hostapd R running task 0 2426 1924 0x0000000a
[ 317.893371] Call trace:
[ 317.897667] [<ffffffc000089860>] dump_backtrace+0x0/0x100
[ 317.904980] [<ffffffc000089a28>] show_stack+0x14/0x1c
[ 317.911935] [<ffffffc0000cefe8>] sched_show_task+0xa8/0xfc
[ 317.919323] [<ffffffc0000d1314>] dump_cpu_task+0x40/0x4c
[ 317.926528] [<ffffffc0000fec28>] rcu_dump_cpu_stacks+0x94/0xe4
[ 317.934245] [<ffffffc000102b88>] rcu_check_callbacks+0x4fc/0xaa0
[ 317.942108] [<ffffffc000107624>] update_process_times+0x3c/0x6c
[ 317.949854] [<ffffffc0001165a8>] tick_sched_handle.isra.16+0x20/0x78
[ 317.958019] [<ffffffc000116644>] tick_sched_timer+0x44/0x7c
[ 317.965376] [<ffffffc000107d64>] __hrtimer_run_queues+0x140/0x350
[ 317.973281] [<ffffffc0001087c4>] hrtimer_interrupt+0x9c/0x1e0
[ 317.980852] [<ffffffc000919724>] tegra186_timer_isr+0x24/0x30
[ 317.988430] [<ffffffc0000f5650>] handle_irq_event_percpu+0x84/0x290
[ 317.996554] [<ffffffc0000f58a0>] handle_irq_event+0x44/0x74
[ 318.003980] [<ffffffc0000f8ba8>] handle_fasteoi_irq+0xb4/0x188
[ 318.011685] [<ffffffc0000f4c70>] generic_handle_irq+0x24/0x38
[ 318.019294] [<ffffffc0000f4f78>] __handle_domain_irq+0x60/0xb4
[ 318.026997] [<ffffffc000081774>] gic_handle_irq+0x5c/0xb4
[ 318.034258] [<ffffffc000084740>] el1_irq+0x80/0xf8
[ 318.040907] [<ffffffc00009e348>] ccm_encrypt+0x194/0x1f4
[ 318.048145] [<ffffffbffcec4bf8>] ieee80211_aes_ccm_encrypt+0x154/0x168 [mac80211]
[ 318.059391] [<ffffffbffceb5140>] ieee80211_crypto_ccmp_encrypt+0x21c/0x248 [mac80211]
[ 318.071085] [<ffffffbffced3700>] invoke_tx_handlers+0xa4c/0xc90 [mac80211]
[ 318.079971] [<ffffffbffced5e5c>] ieee80211_tx+0x74/0xf8 [mac80211]
[ 318.088143] [<ffffffbffced5f7c>] ieee80211_xmit+0x9c/0x104 [mac80211]
[ 318.096567] [<ffffffbffced69d4>] __ieee80211_subif_start_xmit+0x468/0x5bc [mac80211]
[ 318.108172] [<ffffffbffced6b38>] ieee80211_subif_start_xmit+0x10/0x1c [mac80211]
[ 318.119542] [<ffffffc0009c31a8>] dev_hard_start_xmit+0x234/0x45c
[ 318.127650] [<ffffffc0009e51a8>] sch_direct_xmit+0xdc/0x208
[ 318.135318] [<ffffffc0009c370c>] __dev_queue_xmit+0x210/0x568
[ 318.143162] [<ffffffc0009c3a74>] dev_queue_xmit+0x10/0x18
[ 318.150661] [<ffffffc000a54ed8>] arp_xmit+0x84/0x90
[ 318.157854] [<ffffffc000a54f20>] arp_send_dst.part.13+0x3c/0x48
[ 318.166092] [<ffffffc000a5581c>] arp_solicit+0xdc/0x220
[ 318.173627] [<ffffffc0009cb8d8>] neigh_probe+0x54/0x88
[ 318.181055] [<ffffffc0009cf6b4>] neigh_timer_handler+0xb4/0x2dc
[ 318.189266] [<ffffffc000105f84>] call_timer_fn+0x50/0x1bc
[ 318.196726] [<ffffffc0001062b0>] run_timer_softirq+0x1ac/0x2a4
[ 318.204619] [<ffffffc0000a8cfc>] __do_softirq+0x10c/0x368
[ 318.212091] [<ffffffc0000a91b0>] irq_exit+0x84/0xdc
[ 318.219038] [<ffffffc0000f4f84>] __handle_domain_irq+0x6c/0xb4
[ 318.226968] [<ffffffc000081774>] gic_handle_irq+0x5c/0xb4
[ 318.234458] [<ffffffc000084740>] el1_irq+0x80/0xf8
[ 318.241327] [<ffffffc0001080f0>] hrtimer_try_to_cancel+0xbc/0x158
[ 318.249531] [<ffffffc000b597ec>] schedule_hrtimeout_range_clock+0x98/0xfc
[ 318.258464] [<ffffffc000b59860>] schedule_hrtimeout_range+0x10/0x18
[ 318.266824] [<ffffffc0001e889c>] poll_schedule_timeout+0x40/0x6c
[ 318.274870] [<ffffffc0001e923c>] do_select+0x578/0x624
[ 318.281981] [<ffffffc0001e9498>] core_sys_select+0x1b0/0x2fc
[ 318.289559] [<ffffffc0001e990c>] SyS_pselect6+0x224/0x244
[ 318.296817] [<ffffffc000084ff0>] el0_svc_naked+0x24/0x28
[ 331.813654] dhd_bcmsdh_send_buf: sdio error -1, abort command and terminate frame.
[ 331.823856] mmcblk0: timed out sending r/w cmd command, card status 0x900
[ 331.832067] mmcblk0: not retrying timeout
[ 331.837509] blk_update_request: I/O error, dev mmcblk0, sector 33845601
[ 331.846172] blk_update_request: I/O error, dev mmcblk0, sector 33845609
[ 331.854700] blk_update_request: I/O error, dev mmcblk0, sector 33845617
[ 331.863202] blk_update_request: I/O error, dev mmcblk0, sector 33845625
[ 331.871710] blk_update_request: I/O error, dev mmcblk0, sector 33845633
[ 331.879837] Aborting journal on device mmcblk0p4-8.
[ 331.883415] dhdcdc_query_ioctl: dhdcdc_msg failed w/status -5
[ 331.971094] EXT4-fs error (device mmcblk0p4): ext4_journal_check_start:56: Detected aborted journal
[ 331.983926] EXT4-fs (mmcblk0p4): Remounting filesystem read-only
Once this error occurs the board hangs for 10/20 seconds and it does not reply to pings. Please note that we have all the logs and data saved on tmpfs and we have the following partition layout on mmcblk0:
/dev/mmcblk0p1 → uboot scripts for dual boot (rootfsA or rootfsB)
/dev/mmcblk0p2 → rootfsA
/dev/mmcblk0p3 → rootfsB
/dev/mmcblk0p4 → data
Our kernel version is 4.4.38 and our rootfs is a minimal ubuntu 16.04 built using debootstrap and the tools/drivers from Jetpack 28.2.1.
Are you aware of similar problems on other Jetson TX2 boards? Could you please suggest us how to proceed in order to find the cause of the problem?
Please note that we have similar errors also on a couple of TX1 boards.
Thanks in advance for your help.