Debugging SD UHS cards on the TX2

I am having some trouble getting a custom carrier board to mount and write to some SD cards. The error is reproducible on a TX2 Jetson DevKit (R28.1).

The errors observed from dmesg are:

[  269.760777] mmc2 tuning done saved tap delay=32
[  269.765310] mmc2: hw tuning done ...
[  269.768885] mmc2: tuning_window[0]: 0xffffc1ff
[  269.773324] mmc2: tuning_window[1]: 0xff07ffff
[  269.777763] mmc2: tuning_window[2]: 0xfffffff
[  269.782114] mmc2: tuning_window[3]: 0x7ffffffc
[  269.786551] mmc2: tuning_window[4]: 0x0
[  269.790381] mmc2: tuning_window[5]: 0x0
[  269.794211] mmc2: tuning_window[6]: 0x0
[  269.798042] mmc2: tuning_window[7]: 0x0
[  269.801870] sdhci: Tap value: 32 | Trim value: 5
[  269.806479] sdhci: SDMMC_VENDOR_INTR_STATUS[0x108]: 0x40000
[  269.812090] mmc2: new ultra high speed SDR104 SDXC card at address aaaa
[  269.819059] mmcblk1: mmc2:aaaa SN64G 59.5 GiB 
[  269.825848] 
mmc2  sdhci_data_irq  2799   SDHCI_INT_DATA_CRC 
[  269.831503] sdhci: =========== REGISTER DUMP (mmc2)===========
[  269.837510] sdhci: Sys addr: 0x00000008 | Version:  0x00000404
[  269.843332] sdhci: Blk size: 0x00007200 | Blk cnt:  0x00000007
[  269.849154] sdhci: Argument: 0x00000000 | Trn mode: 0x0000003b
[  269.854975] sdhci: Present:  0x01fb0008 | Host ctl: 0x00000017
[  269.860796] sdhci: Power:    0x00000001 | Blk gap:  0x00000000
[  269.866617] sdhci: Wake-up:  0x00000000 | Clock:    0x00000007
[  269.872439] sdhci: Timeout:  0x0000000e | Int stat: 0x00000000
[  269.878260] sdhci: Int enab: 0x02ff000b | Sig enab: 0x02fc000b
[  269.884081] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[  269.889902] sdhci: Caps:     0x3f6cd08c | Caps_1:   0x18006f73
[  269.895722] sdhci: Cmd:      0x0000123a | Max curr: 0x00000000
[  269.901542] sdhci: Host ctl2: 0x0000308b
[  269.905457] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x0000000080000010
[  269.911968] sdhci: ===========================================
[  269.925840] mmcblk1: error -110 sending stop command, original cmd response 0x900, card status 0x400900
[  269.935298] mmcblk1: retrying because a re-tune was needed
[  269.940834] sdhci-tegra 3400000.sdhci: Tuning already done, restoring the best tap value : 32
[  269.949743] 
mmc2  sdhci_data_irq  2799   SDHCI_INT_DATA_CRC 
[  269.955393] sdhci: =========== REGISTER DUMP (mmc2)===========
[  269.961399] sdhci: Sys addr: 0x00000008 | Version:  0x00000404
[  269.967220] sdhci: Blk size: 0x00007200 | Blk cnt:  0x00000007
[  269.973042] sdhci: Argument: 0x00000000 | Trn mode: 0x0000003b
[  269.978863] sdhci: Present:  0x01fb0008 | Host ctl: 0x00000017
[  269.984685] sdhci: Power:    0x00000001 | Blk gap:  0x00000000
[  269.990506] sdhci: Wake-up:  0x00000000 | Clock:    0x00000007
[  269.996327] sdhci: Timeout:  0x0000000e | Int stat: 0x00000000
[  270.002147] sdhci: Int enab: 0x02ff000b | Sig enab: 0x02fc000b
[  270.007968] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[  270.013788] sdhci: Caps:     0x3f6cd08c | Caps_1:   0x18006f73
[  270.019609] sdhci: Cmd:      0x0000123a | Max curr: 0x00000000
[  270.025428] sdhci: Host ctl2: 0x0000308b
[  270.029344] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x0000000080000010
[  270.035855] sdhci: ===========================================
[  270.049770] mmcblk1: error -110 sending stop command, original cmd response 0x900, card status 0x400900
[  270.059169] mmcblk1: error -84 transferring data, sector 0, nr 8, cmd response 0x900, card status 0x0
[  270.068418] sdhci-tegra 3400000.sdhci: Tuning already done, restoring the best tap value : 32
[  270.146188] mmc2: tried to reset card
[  270.151288]  mmcblk1: p1

The card I’m having trouble with is the SanDisk Extreme A2 microSD: https://www.sandisk.com/home/memory-cards/microsd-cards/extreme-microsd-a2

A slightly different model SD card works flawlessly: SanDisk Extreme A1 microSD, https://www.sandisk.com/home/memory-cards/microsd-cards/extreme-microsd

Both cards are the 64 GB model. The same problem was also observed on some ADATA cards. The SD cards tested appear to work in other systems without any issue.

I’m running a kernel based on the L4T 28.1 release, with the patch from https://devtalk.nvidia.com/default/topic/1031139/tx2-sd-card-driver-bug-sd-does-not-enumerate-in-uhs-mode/

The same error is reproducible on the Jetson DevKit (L4T 28.1), but only with an SD card extender connected to the DevKit. The SD-to-SD extender used with the DevKit is similar to https://www.amazon.ca/Extension-Adapter-Flexible-Extender-RS-MMC/dp/B07BZGNRP7/.

My questions:

  1. Is there any advice on how to debug this problem?

  2. What is the recommended method to qualify the SD card interface with the TX2? Are there any options available in software?

  3. Are there available APIs to tune the SD card interface, similar to the ones available for the USB interface?

I would suggest to move to latest release (rel-28.2.1) and do the test again. If there is still error, please let me know.

Having the same errors (roughly) with the following setup:

Freshly flashed TX2 (Jetpack 3.3, L4T 28.2.1)
Our card is this: SanDisk Ultra 64GB microSDXC UHS-I card - SDSQUAR-064G-GN6MA
Card has one partition with, gpt partition table, formatted as ext4

Applied patches from https://devtalk.nvidia.com/default/topic/1031139/jetson-tx2/tx2-sd-card-driver-bug-sd-does-not-enumerate-in-uhs-mode/post/5250586/#5250586

The card appears in lsblk, but it seems to take too long to attach properly to the TX2 because our line in /etc/fstab does not mount the drive on boot (subsequent sudo mount -a works).

some dmesg output:

[    7.133311] mmc2: tried to reset card
[    7.134655]
               mmc2  sdhci_data_irq  2792   SDHCI_INT_DATA_END_BIT
[    7.134655] sdhci: =========== REGISTER DUMP (mmc2)===========
[    7.134659] sdhci: Sys addr: 0x00000000 | Version:  0x00000404
[    7.134662] sdhci: Blk size: 0x00007200 | Blk cnt:  0x00000000
[    7.134665] sdhci: Argument: 0x0000080b | Trn mode: 0x00000013
[    7.134668] sdhci: Present:  0x01fb0000 | Host ctl: 0x00000016
[    7.134671] sdhci: Power:    0x00000001 | Blk gap:  0x00000000
[    7.134674] sdhci: Wake-up:  0x00000000 | Clock:    0x00000007
[    7.134677] sdhci: Timeout:  0x0000000e | Int stat: 0x00000000
[    7.134680] sdhci: Int enab: 0x02ff000b | Sig enab: 0x02fc000b
[    7.134683] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[    7.134686] sdhci: Caps:     0x3f6cd08c | Caps_1:   0x18006f73
[    7.134689] sdhci: Cmd:      0x0000113a | Max curr: 0x00000000
[    7.134691] sdhci: Host ctl2: 0x0000300b
[    7.134695] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x0000000080000010
[    7.134695] sdhci: ===========================================
[    7.134761] mmcblk1: error -84 transferring data, sector 2059, nr 5, cmd response 0x900, card status 0x0
[    7.134773] blk_update_request: I/O error, dev mmcblk1, sector 2059
[    7.134788] sdhci-tegra 3400000.sdhci: Tuning already done, restoring the best tap value : 93

Update:

Whether it mounts seems to be somewhat non-deterministic. Occasionally I’m seeing it mount properly and dmesg looks like this:

[    2.900909] mmc2: SDHCI controller on 3400000.sdhci [3400000.sdhci] using ADMA 64-bit with 64 bit addr
[    3.485388] mmc2 tuning done saved tap delay=49
[    3.485392] mmc2: hw tuning done ...
[    3.485400] mmc2: tuning_window[0]: 0xffeff
[    3.485437] mmc2: tuning_window[1]: 0xfffffff8
[    3.485443] mmc2: tuning_window[2]: 0xeeff8000
[    3.485450] mmc2: tuning_window[3]: 0x78001fff
[    3.485476] mmc2: tuning_window[4]: 0x0
[    3.485482] mmc2: tuning_window[5]: 0x0
[    3.485489] mmc2: tuning_window[6]: 0x0
[    3.485494] mmc2: tuning_window[7]: 0x0
[    3.485540] mmc2: new ultra high speed SDR104 SDXC card at address aaaa
[    3.486081] mmcblk1: mmc2:aaaa SC64G 59.5 GiB
               mmc2  sdhci_data_irq  2792   SDHCI_INT_DATA_END_BIT
[    6.718628] sdhci: =========== REGISTER DUMP (mmc2)===========
               mmc2  sdhci_data_irq  2792   SDHCI_INT_DATA_END_BIT
[    6.721112] sdhci: =========== REGISTER DUMP (mmc2)===========
[    6.944749] mmc2: tried to reset card
               mmc2  sdhci_data_irq  2792   SDHCI_INT_DATA_END_BIT
[    6.946122] sdhci: =========== REGISTER DUMP (mmc2)===========
               mmc2  sdhci_data_irq  2792   SDHCI_INT_DATA_END_BIT
[    6.947222] sdhci: =========== REGISTER DUMP (mmc2)===========
[    7.176986] mmc2: tried to reset card
               mmc2  sdhci_data_irq  2792   SDHCI_INT_DATA_END_BIT
[    7.178310] sdhci: =========== REGISTER DUMP (mmc2)===========
               mmc2  sdhci_data_irq  2792   SDHCI_INT_DATA_END_BIT
[    7.178691] sdhci: =========== REGISTER DUMP (mmc2)===========
               mmc2  sdhci_data_irq  2792   SDHCI_INT_DATA_END_BIT
[    7.178976] sdhci: =========== REGISTER DUMP (mmc2)===========
               mmc2  sdhci_data_irq  2792   SDHCI_INT_DATA_END_BIT
...

The exact same SD card seems to work flawlessly on RT 28.2.0 on TX1

Could you reproduce issue on rel-28.1 TX2? Just trying to look for a clue…

Unfortunately this is not our top priority at the moment and I’m a bit too busy to give much time to this. Is there anything you guys can look into internally? It would be extremely helpful.

If we don’t have that card, the debug may not be proceeded. I’ll ask internal team to help check.
Please keep following up this thread if you are available.

Could you also share full dmesg with error logs?

Here’s more dmesg logs (pretty much full I believe) and I’m not sure what additional error logs you want, but I’ve included some relevant lines from syslog.

With respect to replicating the issue, everything I’ve seen online about this issue seems to come down to the card being UHS (specifically I’ve seen people reporting trouble with UHS-1, but I’d guess any UHS will produce the problem). The SDHCI driver/module seems to recognize the device without issue, but repeatedly fails to “tune the hardware.” As I’ve mentioned previously, the behavior seems fairly non-deterministic. Sometimes it mounts no problem (fstab and all), sometimes comes up (i.e. in lsblk) but seemingly too late for fstab, and sometimes never shows up at all as if it wasn’t plugged in.

As an aside, we’re using the TX2 with a custom PCB. We’ve used TX1 with this PCB previously and had no issues with SD cards, which makes me think the PCB is not to blame (we had device tree problems initially, but solved those).

Syslog:

Jan 28 14:47:45 new_unit_placeholer kernel: [    3.599472] mmc2: tuning execution failed
Jan 28 14:47:45 new_unit_placeholer kernel: [    3.599482] mmc2: error -5 whilst initialising SD card
Jan 28 14:47:45 new_unit_placeholer kernel: [    3.971597] sdhci: Tuning procedure failed, falling back to fixed sampling clock
Jan 28 14:47:45 new_unit_placeholer kernel: [    3.971603] mmc2: tuning execution failed
Jan 28 14:47:45 new_unit_placeholer kernel: [    3.971615] mmc2: error -5 whilst initialising SD card
Jan 28 14:47:45 new_unit_placeholer kernel: [    4.344619] sdhci: Tuning procedure failed, falling back to fixed sampling clock
Jan 28 14:47:45 new_unit_placeholer kernel: [    4.344626] mmc2: tuning execution failed
Jan 28 14:47:45 new_unit_placeholer kernel: [    4.344637] mmc2: error -5 whilst initialising SD card
Jan 28 14:47:45 new_unit_placeholer kernel: [    4.623480] mmc2: error -110 whilst initialising SD card

Dmesg up until boot is complete: https://gist.github.com/jlucier/8b560c7335d42f2d60a0f4b50def7281

Could you also reproduce this issue on devkit?
As you know that patch from topic 1031139 was already a fix for UHS mode. Thus, I don’t think we could hit this issue w/ arbitrary card.

Anyway, we will run the test with all uhs card we have. Thanks.

I actually cannot reproduce this on our devkit. It seems that this only happens on our custom board, which is strange because as I mentioned we didn’t have trouble with TX1 on this board.

To test our card with the devkit I needed to use a micro -> full SD card adapter. Not sure if that would make any difference.

Do you have any idea as to what the issue could be now? Seems like it may not be drivers or the card, so does that leave us with a device tree problem? Something strange with our board + TX2? Are there differences between TX1 and TX2 SD card pinouts?

Also, thanks for all the follow up thus far.

Apparently our board was designed to support up to 1.44W in SDR50 or DDR50 mode, but UHS-1 can consume up to 2.88W in SDR104 mode.

Does TX1 not support SDR104? Could that explain the difference in behavior?

Is there a workaround to force these cards to only enter SDR50 mode? Or would we require device tree changes or driver patches?

Please try with this patch.

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index ff48515..956cb9d 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -1857,12 +1857,11 @@
 	/* Keep clock gated for at least 10 ms, though spec only says 5 ms */
 	mmc_delay(10);
 	host->ios.clock = clock;
-	host->skip_host_clkgate = false;
 	mmc_set_ios(host);
 
 	/* Wait for at least 1 ms according to spec */
 	mmc_delay(1);
-
+	host->skip_host_clkgate = false;
 	/*
 	 * Failure to switch is indicated by the card holding
 	 * dat[0:3] low

Sorry for the long delay, but I just got to this again.

As far as I can tell, that patch is working! I am able to get the card to show up and mount every time (lsblk, fstab, etc.).

Moving forward, will this patch get integrated into a future release? Or will we need to patch each time?

We have integrated the patch to our codebase. It should be in next release.

I’ve been attempting to roll out the patch to our fleet of devices, and the confidence seems to have been a bit premature. The device I did in house seemed to totally fix the problem (SD came up each time across very many reboots), however I’m not getting the same results now that I’ve done it to more devices and seen more reboots. I’m seeing the below output repeatedly in dmesg on devices after patching, and the same non-deterministic card behavior of mounting/appearing in lsblk. It might be happening less often now, but happening ever is not workable for us. Please advise.

[    7.891915]
               mmc2  sdhci_data_irq  2781   SDHCI_INT_DATA_END_BIT
[    7.891915] sdhci: =========== REGISTER DUMP (mmc2)===========
[    7.891919] sdhci: Sys addr: 0x00000008 | Version:  0x00000404
[    7.891922] sdhci: Blk size: 0x00007200 | Blk cnt:  0x00000000
[    7.891925] sdhci: Argument: 0x076f4ff8 | Trn mode: 0x0000003b
[    7.891928] sdhci: Present:  0x01fb0000 | Host ctl: 0x00000016
[    7.891931] sdhci: Power:    0x00000001 | Blk gap:  0x00000000
[    7.891934] sdhci: Wake-up:  0x00000000 | Clock:    0x00000007
[    7.891936] sdhci: Timeout:  0x0000000e | Int stat: 0x00000000
[    7.891939] sdhci: Int enab: 0x02ff000b | Sig enab: 0x02fc000b
[    7.891942] sdhci: AC12 err: 0x00000000 | Slot int: 0x00000000
[    7.891945] sdhci: Caps:     0x3f6cd08c | Caps_1:   0x18006f73
[    7.891948] sdhci: Cmd:      0x0000123a | Max curr: 0x00000000
[    7.891949] sdhci: Host ctl2: 0x0000300b
[    7.891953] sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x0000000080000010
[    7.891954] sdhci: ===========================================
[    7.892017] mmcblk1: error -110 sending stop command, original cmd response 0x900, card status 0x400900
[    7.892018] mmcblk1: retrying because a re-tune was needed
[    7.892037] sdhci-tegra 3400000.sdhci: Tuning already done, restoring the best tap value : 51

Occasionally with these additional lines sprinkled in:

[    7.784438] blk_update_request: I/O error, dev mmcblk1, sector 2063
[    7.784442] Buffer I/O error on dev mmcblk1p1, logical block 1, async page read
[    7.892409] mmcblk1: error -84 transferring data, sector 124735480, nr 8, cmd response 0x900, card status 0x0

And when it doesn’t come up I see this:

[   10.978064] sdhci-tegra 3400000.sdhci: Tuning already done, restoring the best tap value : 101