emmc failure on multiple boards

Hello,
we have just received 4 new Nvidia Jetson TX2 boards and are experiencing serious problems with the embedded emmc. Specifically, after a certain amount of time, the following error is reported by dmesg (we experienced the problem on all the boards)

[  309.833708] mmc0: Timeout waiting for hardware interrupt.                                       
[  309.841798] sdhci: =========== REGISTER DUMP (mmc0)===========
[  309.850284] sdhci: Sys addr: 0x00000028 | Version:  0x00000404
[  309.858747] sdhci: Blk size: 0x00007200 | Blk cnt:  0x00000000
[  309.867133] sdhci: Argument: 0x02047161 | Trn mode: 0x0000002b
[  309.875456] sdhci: Present:  0x01fb00f0 | Host ctl: 0x00000035
[  309.883728] sdhci: Power:    0x00000001 | Blk gap:  0x00000000
[  309.891990] sdhci: Wake-up:  0x00000000 | Clock:    0x00000007
[  309.900177] sdhci: Timeout:  0x0000000e | Int stat: 0x00000000
[  309.908302] sdhci: Int enab: 0x02ff000b | Sig enab: 0x02fc000b
[  309.916414] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[  309.924469] sdhci: Caps:     0x3f6cd08c | Caps_1:   0x18006f73
[  309.932484] sdhci: Cmd:      0x0000193a | Max curr: 0x00000000
[  309.940480] sdhci: Host ctl2: 0x0000300d
[  309.946499] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x0000000080000010
[  309.955129] sdhci: ===========================================
[  310.793689] mmc1: Timeout waiting for hardware interrupt.
[  310.801201] sdhci: =========== REGISTER DUMP (mmc1)===========
[  310.809110] sdhci: Sys addr: 0x00000000 | Version:  0x00000404
[  310.817021] sdhci: Blk size: 0x00007040 | Blk cnt:  0x00000000
[  310.824869] sdhci: Argument: 0xa5000040 | Trn mode: 0x00000003
[  310.832664] sdhci: Present:  0x01fb0000 | Host ctl: 0x00000017
[  310.840449] sdhci: Power:    0x00000001 | Blk gap:  0x00000000
[  310.848164] sdhci: Wake-up:  0x00000000 | Clock:    0x00000007
[  310.855814] sdhci: Timeout:  0x0000000e | Int stat: 0x00000000
[  310.863438] sdhci: Int enab: 0x02ff000b | Sig enab: 0x02fc000b
[  310.871004] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[  310.878519] sdhci: Caps:     0x3f6cd08c | Caps_1:   0x18006f73
[  310.886029] sdhci: Cmd:      0x0000353a | Max curr: 0x00000000
[  310.893479] sdhci: Host ctl2: 0x0000300b
[  310.898976] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x00000000fc200010
[  310.907104] sdhci: ===========================================
[  317.853659] INFO: rcu_preempt self-detected stall on CPU
[  317.860674]  0-...: (1 GPs behind) idle=f7b/140000000000002/0 softirq=40730/40731 fqs=5217 
[  317.872439]   (t=5250 jiffies g=19667 c=19666 q=25194)
[  317.879390] Task dump for CPU 0:
[  317.884405] hostapd         R  running task        0  2426   1924 0x0000000a
[  317.893371] Call trace:
[  317.897667] [<ffffffc000089860>] dump_backtrace+0x0/0x100
[  317.904980] [<ffffffc000089a28>] show_stack+0x14/0x1c
[  317.911935] [<ffffffc0000cefe8>] sched_show_task+0xa8/0xfc
[  317.919323] [<ffffffc0000d1314>] dump_cpu_task+0x40/0x4c
[  317.926528] [<ffffffc0000fec28>] rcu_dump_cpu_stacks+0x94/0xe4
[  317.934245] [<ffffffc000102b88>] rcu_check_callbacks+0x4fc/0xaa0
[  317.942108] [<ffffffc000107624>] update_process_times+0x3c/0x6c
[  317.949854] [<ffffffc0001165a8>] tick_sched_handle.isra.16+0x20/0x78
[  317.958019] [<ffffffc000116644>] tick_sched_timer+0x44/0x7c
[  317.965376] [<ffffffc000107d64>] __hrtimer_run_queues+0x140/0x350
[  317.973281] [<ffffffc0001087c4>] hrtimer_interrupt+0x9c/0x1e0
[  317.980852] [<ffffffc000919724>] tegra186_timer_isr+0x24/0x30
[  317.988430] [<ffffffc0000f5650>] handle_irq_event_percpu+0x84/0x290
[  317.996554] [<ffffffc0000f58a0>] handle_irq_event+0x44/0x74
[  318.003980] [<ffffffc0000f8ba8>] handle_fasteoi_irq+0xb4/0x188
[  318.011685] [<ffffffc0000f4c70>] generic_handle_irq+0x24/0x38
[  318.019294] [<ffffffc0000f4f78>] __handle_domain_irq+0x60/0xb4
[  318.026997] [<ffffffc000081774>] gic_handle_irq+0x5c/0xb4
[  318.034258] [<ffffffc000084740>] el1_irq+0x80/0xf8
[  318.040907] [<ffffffc00009e348>] ccm_encrypt+0x194/0x1f4
[  318.048145] [<ffffffbffcec4bf8>] ieee80211_aes_ccm_encrypt+0x154/0x168 [mac80211]
[  318.059391] [<ffffffbffceb5140>] ieee80211_crypto_ccmp_encrypt+0x21c/0x248 [mac80211]
[  318.071085] [<ffffffbffced3700>] invoke_tx_handlers+0xa4c/0xc90 [mac80211]
[  318.079971] [<ffffffbffced5e5c>] ieee80211_tx+0x74/0xf8 [mac80211]
[  318.088143] [<ffffffbffced5f7c>] ieee80211_xmit+0x9c/0x104 [mac80211]
[  318.096567] [<ffffffbffced69d4>] __ieee80211_subif_start_xmit+0x468/0x5bc [mac80211]
[  318.108172] [<ffffffbffced6b38>] ieee80211_subif_start_xmit+0x10/0x1c [mac80211]
[  318.119542] [<ffffffc0009c31a8>] dev_hard_start_xmit+0x234/0x45c
[  318.127650] [<ffffffc0009e51a8>] sch_direct_xmit+0xdc/0x208
[  318.135318] [<ffffffc0009c370c>] __dev_queue_xmit+0x210/0x568
[  318.143162] [<ffffffc0009c3a74>] dev_queue_xmit+0x10/0x18
[  318.150661] [<ffffffc000a54ed8>] arp_xmit+0x84/0x90
[  318.157854] [<ffffffc000a54f20>] arp_send_dst.part.13+0x3c/0x48
[  318.166092] [<ffffffc000a5581c>] arp_solicit+0xdc/0x220
[  318.173627] [<ffffffc0009cb8d8>] neigh_probe+0x54/0x88
[  318.181055] [<ffffffc0009cf6b4>] neigh_timer_handler+0xb4/0x2dc
[  318.189266] [<ffffffc000105f84>] call_timer_fn+0x50/0x1bc
[  318.196726] [<ffffffc0001062b0>] run_timer_softirq+0x1ac/0x2a4
[  318.204619] [<ffffffc0000a8cfc>] __do_softirq+0x10c/0x368
[  318.212091] [<ffffffc0000a91b0>] irq_exit+0x84/0xdc
[  318.219038] [<ffffffc0000f4f84>] __handle_domain_irq+0x6c/0xb4
[  318.226968] [<ffffffc000081774>] gic_handle_irq+0x5c/0xb4
[  318.234458] [<ffffffc000084740>] el1_irq+0x80/0xf8
[  318.241327] [<ffffffc0001080f0>] hrtimer_try_to_cancel+0xbc/0x158
[  318.249531] [<ffffffc000b597ec>] schedule_hrtimeout_range_clock+0x98/0xfc
[  318.258464] [<ffffffc000b59860>] schedule_hrtimeout_range+0x10/0x18
[  318.266824] [<ffffffc0001e889c>] poll_schedule_timeout+0x40/0x6c
[  318.274870] [<ffffffc0001e923c>] do_select+0x578/0x624
[  318.281981] [<ffffffc0001e9498>] core_sys_select+0x1b0/0x2fc
[  318.289559] [<ffffffc0001e990c>] SyS_pselect6+0x224/0x244
[  318.296817] [<ffffffc000084ff0>] el0_svc_naked+0x24/0x28
[  331.813654] dhd_bcmsdh_send_buf: sdio error -1, abort command and terminate frame.
[  331.823856] mmcblk0: timed out sending r/w cmd command, card status 0x900
[  331.832067] mmcblk0: not retrying timeout
[  331.837509] blk_update_request: I/O error, dev mmcblk0, sector 33845601
[  331.846172] blk_update_request: I/O error, dev mmcblk0, sector 33845609
[  331.854700] blk_update_request: I/O error, dev mmcblk0, sector 33845617
[  331.863202] blk_update_request: I/O error, dev mmcblk0, sector 33845625
[  331.871710] blk_update_request: I/O error, dev mmcblk0, sector 33845633
[  331.879837] Aborting journal on device mmcblk0p4-8.
[  331.883415] dhdcdc_query_ioctl: dhdcdc_msg failed w/status -5
[  331.971094] EXT4-fs error (device mmcblk0p4): ext4_journal_check_start:56: Detected aborted journal
[  331.983926] EXT4-fs (mmcblk0p4): Remounting filesystem read-only

Once this error occurs the board hangs for 10/20 seconds and it does not reply to pings. Please note that we have all the logs and data saved on tmpfs and we have the following partition layout on mmcblk0:

/dev/mmcblk0p1 -> uboot scripts for dual boot (rootfsA or rootfsB)
/dev/mmcblk0p2 -> rootfsA
/dev/mmcblk0p3 -> rootfsB
/dev/mmcblk0p4 -> data

Our kernel version is 4.4.38 and our rootfs is a minimal ubuntu 16.04 built using debootstrap and the tools/drivers from Jetpack 28.2.1.

Are you aware of similar problems on other Jetson TX2 boards? Could you please suggest us how to proceed in order to find the cause of the problem?

Please note that we have similar errors also on a couple of TX1 boards.

Thanks in advance for your help.

You might mention what is running at the time of the error. A timeout could be from a number of issues (including eMMC failure), but it might also be caused by the interrupt handler not being available (IRQ starvation).

Also, if you run with “sudo nvpmodel -m 0” and “sudo ~ubuntu/jetson_clocks.sh”, does the issue go away? This maxes out performance.

Seems you’re sending encrypted data through wifi.
Not sure it is related, but I see broadcom dongle host driver sending error messages after that. It may just be a consequence, but for ruling it out, you may try to use another path (wired ethernet).

Hello linuxdev, thanks for your prompt answer. We are an home-automation startup and are running a number of applications on our boards using containers (we are using balena https://www.balena.io/ to run containers). The most important one is a computer-vision application whose due is to recognize objects inside the house. Also, we have other applications to manage smart objects in the house (lights, roller shutters, etc).

The applications do not write data to disk and logs are stored on tmpfs. Please note that 3 CPUs are idle while the CPU consumption of the active core is at most 80%.

We have just finished to do some tests setting the maximum performance mode (using nvpmodel and jetson_clocks scripts) as you suggested and, unfortunately, the problem still occurs.

We are going to do some tests disabling the broadcom WiFi card as suggested by Honey_Patouceul as soon as possible and report you back our results.

Thanks again for your help.

FYI the stack trace on other boards is different from the one we posted.

We tried to disable the Broadcom WiFi adapter and use ethernet connection instead, but the problem is still present.

One thing to keep in mind is that drivers are triggered to run by an IRQ. In the case of hardware I/O, then a hardware IRQ is issued and then the scheduler will pick a time to run.

Most desktop PCs can run all types of I/O drivers from any CPU core. Much of the TX2’s I/O can only run on CPU0. More IRQs being issued implies a greater chance that your driver will be preempted by some other driver if it requires CPU0. If the nvpmodel and jetson_clocks.sh are maxed out and testing shows an improvement on I/O errors then it may just be as simple as competition for CPU0 being at fault.

Does the error rate reduce when performance is maxed out?

Hello linuxdev, thanks again for your help. Unfortunately, we did not notice any change in the error rate when maximum performance are used.

FYI, it seems that we have found the cause of our problem. We are currently using an Atheros ath9k card (AR9462 chipset), connected to the board through the M.2 slot, to connect to a second WiFi network. We removed the Atheros card from all the boards and the errors related to the emmc completely disappeared. We had our boards up for more than 20 hours and no errors occurred at all.

Could the problem of mmc timeout be due to a conflict between the emmc and ath9k driver?

The ath9k driver could be technically valid and still be a problem if it uses a lot of time on CPU0. I think you’d have to profile the ath9k driver to see if it is something simple like using too much CPU time.

One possibility is that an incorrectly configured network can lead to a storm of traffic even if you don’t see the traffic. Although this is vague, there is a possibility that you can observe interrupt rate from “/proc/interrupts” (e.g., save a copy each second for 60 seconds) with and without the card and see if it looks “reasonable” (and I can’t tell you what “reasonable” is).