SD-Card not accessible when SOM getting warm

Hi,

we have developed a custom base board that uses nano or xavier nx SOM.
This board have a sd card slot using sdhci3.
The sdhci interface is initialised as mmc1 and when a card is inserted it will properly detected as mmcblk1.
At this state we can use the card like expected.
But after some time we got errors and and any acces to the card results in errors.
Ejecting and plugin again shows the error “mmc1: error -110 whilst initialising SD card” in dmesg.
Even after rebooting the error persists.

The only way to get the sd card working again is to poweroff the device and wait for some time.
When we now powering on the device the card is usable again for limited time until we get errors again.

Due to this behaviour we assumed thermal problems.
When cooling the SOM and keeping the temperature at around 30°C we can use the sd card for hours.
But without cooling the SOM goes over 50°C and the problem occours.

We have attached a dmesg where we boot the module and run a script mounting and unmounting the sd-card periodically until the error occours.

Have nvidia any hint why we get communication problems when temperature is rising?
Is there anything done in software at specific temperature levels?

Thanks in advance

Daniel
dmesg-sd-card.log (108.2 KB)

Did you apply all of these patches when you bring up this card slot?

Hi WayneWWW,

we applied all changes to the Nano but the error still persists.
When changing the test script to test only write processes we get following error after some time:

[  488.798065] mmcblk1: error -110 requesting status                            
[  489.378613] mmc1: tried to reset card, got error -110
[  489.384033] blk_update_request: I/O error, dev mmcblk1, sector 8193
[  489.390533] Buffer I/O error on dev mmcblk1p1, logical block 1, lost async page write
[  489.400480] mmcblk1: error -110 sending status command, retrying
[  489.407670] mmcblk1: error -110 sending status command, retrying
[  489.414557] mmcblk1: error -110 sending status command, aborting
[  489.420764] blk_update_request: I/O error, dev mmcblk1, sector 9650
[  489.427430] Buffer I/O error on dev mmcblk1p1, logical block 1458, lost async page write
[  489.438013] mmcblk1: error -110 sending status command, retrying
[  489.445124] mmcblk1: error -110 sending status command, retrying
[  489.452530] mmcblk1: error -110 sending status command, aborting
[  489.459170] blk_update_request: I/O error, dev mmcblk1, sector 13350
[  489.465950] Buffer I/O error on dev mmcblk1p1, logical block 5158, lost async page write
[  489.475997] mmcblk1: error -110 sending status command, retrying
[  489.482967] mmcblk1: error -110 sending status command, retrying
[  489.489558] mmcblk1: error -110 sending status command, aborting
[  489.496078] blk_update_request: I/O error, dev mmcblk1, sector 16384
[  489.502936] Buffer I/O error on dev mmcblk1p1, logical block 8192, lost async page write

And calling umount now leads to:

[ 1186.264797] mmcblk1: error -110 sending status command, retrying
[ 1186.271321] mmcblk1: error -110 sending status command, retrying
[ 1186.278199] mmcblk1: error -110 sending status command, aborting
[ 1186.499591] mmc1: tried to reset card, got error -2
[ 1186.504877] blk_update_request: I/O error, dev mmcblk1, sector 8985
[ 1186.511626] blk_update_request: I/O error, dev mmcblk1, sector 8986
[ 1186.518110] blk_update_request: I/O error, dev mmcblk1, sector 8987
[ 1186.524644] blk_update_request: I/O error, dev mmcblk1, sector 8988
[ 1186.531262] blk_update_request: I/O error, dev mmcblk1, sector 8989
[ 1186.537882] blk_update_request: I/O error, dev mmcblk1, sector 8990
[ 1186.544607] blk_update_request: I/O error, dev mmcblk1, sector 8991
[ 1186.551275] blk_update_request: I/O error, dev mmcblk1, sector 8992
[ 1186.557886] blk_update_request: I/O error, dev mmcblk1, sector 8993
[ 1186.564387] blk_update_request: I/O error, dev mmcblk1, sector 8994
[ 1186.573482] mmcblk1: error -110 sending status command, retrying
[ 1186.581017] mmcblk1: error -110 sending status command, retrying
[ 1186.588416] mmcblk1: error -110 sending status command, aborting
[ 1186.604261] mmcblk1: error -110 sending status command, retrying
[ 1186.610768] mmcblk1: error -110 sending status command, retrying
[ 1186.617396] mmcblk1: error -110 sending status command, aborting
[ 1186.629629] mmcblk1: error -110 sending status command, retrying
[ 1186.636213] mmcblk1: error -110 sending status command, retrying
[ 1186.642676] mmcblk1: error -110 sending status command, aborting
[ 1186.648819] FAT-fs (mmcblk1p1): FAT read failed (blocknr 793)

But the card is still visible at /dev/mmcblk1. I’m not sure but before the patches the card was not listed at /dev when the errors happened.

Is the patch from Jetson Nano SD card enters back to high speed mode instead of uhs mode after soft reboot - #4 by WayneWWW also applicable for Xavier NX?

If that patch is not for dts, then it can be applied to NX too.

Does this issue happen to both NX and Nano or just one platform?

What was this card doing when temperature rises? Is there any activity on this card?

The issue happens to both NX and Nano.

When the card is not used (e.g. no write or mount) no error occours.
But any acces to the card leads to the error.
Like said we use a simple script that periodically writes to the card to check the functionality.

So this issue requires both condition here?

  1. put device in a environment with >50C
  2. Any read/write to this mmc1

If you directly read/write mmc1 in <50C environment, there would be no problem?

Correct we need a specific temperature level and read/write/mount actions to mmc1.
Please note we measure the temperature with tegrastats.
And the temperature level when the issue happens is about 50C.

When we cool the SOM with a fan and the temperature measured with tegrastats remains about 35C no issue happens.