Hello, I am having some eMMC related Issues on my 64G AGX Orin Modules.
The whole system seems to hang and I am only able to gain access trough the serial port to the device. I have attached the seral log file to this post.
The Logs indicate that a CQHCI Timeout is happening that suggests that the device is timing out while waiting for a response from the hardware.
It then attempts to run CQE recovery to recover from the timeout. I think this is meant to handle errors and restore normal operation, but it seems to be failing repeatedly.
CBB-Fabric Errors: The repeated SLAVE_ERR messages related to cbb-fabric suggest that there are issues with data transactions across the SoC, specifically involving the CCPLEX (which is likely the CPU cluster). These errors may be linked to the MMC timeouts.
when the system is left in this state it outputs the same errors over serial repeatedly. A full power cycle recovers it to a nominal state, but after an undefined amount of time it will error again.
This issue is not isolated to a single unit and is now happening on 3 other 64G AGX Orin Modules.
I will need to try to find a solution first for r35.3.1 before I test on r35.6 as we already have a lot of units in the filed running JP 5.1.1. Shall I still try SivaRamaKrishnaNV instructions of changing the SCR_CONFIG?
Hi WayneWWW, we do plan to test on rel-35.6, in the mean time we have also caught some additional logs over serial that happen just before the REGISTER DUMP’s occurs:
A cache flush error normally means the system had trouble writing cached data to the eMMC. The -110 I think means a timeout has occurred?
The arm-smmu also encountered a context fault, likely meaning that an attempt was made to access an invalid memory address iova=0x00000200. I have found a post here: arm-smmu-8000000-iommu, but there is no solution/explanation into the cause of the arm-smmu issue.
Have you seen this error before and do you think it could cause the issue we are seeing?
Just to clarify. This could be module defect issue so I don’t think you should care about any else at this moment but put your module into NV devkit to test it first.
If it has issue on NV devkit with latest BSP, then it could be hardware issue and you will need RMA.
mmc issue means it is related to emmc and 110 means timeout happened, but any kind of error could lead to timeout.
The last 2 digits of cbfrsynra indicates the stream id and could be found nvidia,tegra234-streamid.h.
This could mean iova address is accessed without map or due to some race condition it got unmapped before access. Not sure if this is related to original issue.
Hi WayneWWW we have tested on our agx Orin dev kit on JP 6 rev 2 and the issue does not persist. We think we may have found the issue. eMMC has changed to a Western digital SDINBDG4-64GB-1212 eMMC outlined in PCN210100and we are on JP 5.1.1. Could this be causing our issue as it is only happening on all our Orin modules with 699-level Part Number ending in 501. Orin modules ending in 500 are appear to be un affected.