Emmc Issues

Hello, I am having some eMMC related Issues on my 64G AGX Orin Modules.
The whole system seems to hang and I am only able to gain access trough the serial port to the device. I have attached the seral log file to this post.

The Logs indicate that a CQHCI Timeout is happening that suggests that the device is timing out while waiting for a response from the hardware.

It then attempts to run CQE recovery to recover from the timeout. I think this is meant to handle errors and restore normal operation, but it seems to be failing repeatedly.

CBB-Fabric Errors: The repeated SLAVE_ERR messages related to cbb-fabric suggest that there are issues with data transactions across the SoC, specifically involving the CCPLEX (which is likely the CPU cluster). These errors may be linked to the MMC timeouts.

when the system is left in this state it outputs the same errors over serial repeatedly. A full power cycle recovers it to a nominal state, but after an undefined amount of time it will error again.

This issue is not isolated to a single unit and is now happening on 3 other 64G AGX Orin Modules.

Has anyone experienced this issue before?

556Errors.txt (12.3 KB)

Dear @matt.read,
May I know if the used platform is DRIVE Orin or Jetson Orin?

Hi SivaRamaKrishnaNV,

I am using the Jetson Orin

Moved the topic to Jetson AGX Orin Forum

Which jetpack release is in use here?

We are using JP 5.1.1 on all our Jetson Orin modules.

Please test if it could be reproduced on rel-35.6 NV devkit.

I will need to try to find a solution first for r35.3.1 before I test on r35.6 as we already have a lot of units in the filed running JP 5.1.1. Shall I still try SivaRamaKrishnaNV instructions of changing the SCR_CONFIG?

I think the post was removed?

Siva was talking about some solution on DRIVE AGX but not Jetson.

As this was confirmed as Jetson issue, please upgrade to rel-35.6 first to test.

We don’t check issues directly on old release. Always use latest one + NV devkit to check the situation first.

Hi WayneWWW, we do plan to test on rel-35.6, in the mean time we have also caught some additional logs over serial that happen just before the REGISTER DUMP’s occurs:

A cache flush error normally means the system had trouble writing cached data to the eMMC. The -110 I think means a timeout has occurred?

The arm-smmu also encountered a context fault, likely meaning that an attempt was made to access an invalid memory address iova=0x00000200. I have found a post here: arm-smmu-8000000-iommu, but there is no solution/explanation into the cause of the arm-smmu issue.

Have you seen this error before and do you think it could cause the issue we are seeing?

What is the 8000000.iommu referring to?

thanks in advanced

HI,

Just to clarify. This could be module defect issue so I don’t think you should care about any else at this moment but put your module into NV devkit to test it first.

If it has issue on NV devkit with latest BSP, then it could be hardware issue and you will need RMA.

mmc issue means it is related to emmc and 110 means timeout happened, but any kind of error could lead to timeout.

The last 2 digits of cbfrsynra indicates the stream id and could be found nvidia,tegra234-streamid.h.
This could mean iova address is accessed without map or due to some race condition it got unmapped before access. Not sure if this is related to original issue.

Hi WayneWWW we have tested on our agx Orin dev kit on JP 6 rev 2 and the issue does not persist. We think we may have found the issue. eMMC has changed to a Western digital SDINBDG4-64GB-1212 eMMC outlined in PCN210100and we are on JP 5.1.1. Could this be causing our issue as it is only happening on all our Orin modules with 699-level Part Number ending in 501. Orin modules ending in 500 are appear to be un affected.

Hi,

Yes, it could be possible. But most likely due to the DRAM but not eMMC itself. PCN updates both eMMC and DRAM.

Hi WayneWWW, I have just looked over the PCN210100. It looks like the DRAM changes are only for the 32G Orin and not the 64G Orins?

We are using the 64G.

Are you saying the 64G also has the new Sk Hynix DRAM?

The software patch to update PCN is actually only for DRAM. eMMC is using generic driver which didn’t change because you change from eMMC A to eMMC B.

Hi @WayneWWW but the PCN states that BSP changes are needed for 64G Orin to accommodate the WD eMMC change

Is the PCN incorrect?

Then please just upgrade the BSP first.

I didn’t look into it so maybe I am wrong. Just told some general case based on my past experience.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.