Test environment: jetpack 5.2.1
Test module: jetpack-Orin-AGX 64G
Test command: ./memtester 30G 5
When we used the “memtester” tool to conduct memory testing, an error occurred suddenly on the fifth test. The error message was EMEM address decode error. The following error information and test log are shown below.
Jul 9 16:13:17 ubuntu kernel: [17988.711552] NVRM nvAssertFailedNoLog: Assertion failed: minRequiredIsoBandwidthKBPS <= clientBwValues[DISPLAY_ICC_BW_CLIENT_EXT].minRequiredIsoBandwidthKBPS @ kern_disp_0402.c:111
Jul 9 16:13:55 ubuntu kernel: [18026.185693] NVRM nvAssertFailedNoLog: Assertion failed: minRequiredIsoBandwidthKBPS <= clientBwValues[DISPLAY_ICC_BW_CLIENT_EXT].minRequiredIsoBandwidthKBPS @ kern_disp_0402.c:111
Jul 9 16:14:00 ubuntu kernel: [18031.447784] NVRM nvAssertFailedNoLog: Assertion failed: minRequiredIsoBandwidthKBPS <= clientBwValues[DISPLAY_ICC_BW_CLIENT_EXT].minRequiredIsoBandwidthKBPS @ kern_disp_0402.c:111
Jul 10 04:25:22 ubuntu kernel: [61913.427605] mc-err: (255) csr_nvl5r: EMEM address decode error
Jul 10 04:25:22 ubuntu kernel: [61913.427618] mc-err: status = 0x20044014; hi_addr_reg = 0x00000000 addr = 0x000b6100
Jul 10 04:25:22 ubuntu kernel: [61913.427624] mc-err: secure: no, access-type: read
Jul 10 04:25:22 ubuntu kernel: [61913.427657] mc-err: (255) csr_nvl5r: EMEM address decode error
Jul 10 04:25:22 ubuntu kernel: [61913.427660] mc-err: status = 0x20000014; hi_addr_reg = 0x00000000 addr = 0x00030000
Jul 10 04:25:22 ubuntu kernel: [61913.427663] mc-err: secure: no, access-type: read
The test screenshot is as follows
We would like to know the reason for this error?
The kernel log is shown in the attachment.
kern.log (1.0 MB)
*** Please note that this reply is generated by LLM automatically ***
Based on the provided context, I will attempt to provide a comprehensive answer to the question.
The issue at hand is related to a memory testing error that occurred on a Jetson AGX Orin module with 64GB of RAM, running Jetpack 5.2.1. The error occurred during the fifth test using the “memtester” tool, with a test command of “./memtester 30G 5”.
To analyze this issue, let’s break down the relevant information:
- Test Environment: Jetpack 5.2.1, Jetson AGX Orin module with 64GB of RAM.
- Test Command: “./memtester 30G 5”, which tests 30GB of memory with 5 iterations.
- Error Message: Not explicitly provided, but a screenshot of the error message is attached.
- Kernel Log: Attached as a file named “kern.log” (1.0 MB).
Given the limited information, I will provide some possible causes and comparisons:
Possible Causes:
- Memory Corruption: The error might be related to memory corruption, which could be caused by a hardware or software issue.
- Kernel Bug: A kernel bug could be causing the memory testing error. The kernel log might provide more information about the error.
- Hardware Issue: A hardware issue with the Jetson AGX Orin module or the RAM could be causing the error.
Comparison with Similar Issues:
- Nvmap Error: There is a similar issue reported on the Jetson Orin NX, which is related to an Nvmap error. Although the error messages are different, it’s possible that there is a common underlying cause.
- Vk_error_device_lost: Another issue reported on the Linux kernel is related to a Vk_error_device_lost error, which might be related to a kernel bug or a hardware issue.
Recommendations:
- Check the Kernel Log: Analyze the kernel log to see if there are any error messages that can provide more information about the cause of the issue.
- Run Additional Tests: Run additional memory tests to see if the issue is reproducible and to gather more information about the error.
- Check for Firmware Updates: Check if there are any firmware updates available for the Jetson AGX Orin module, as this might resolve the issue.
- Contact NVIDIA Support: If the issue persists, contact NVIDIA support for further assistance, as they may be able to provide more detailed guidance or repair/replace the module if it’s a hardware issue.
In summary, the error message and kernel log are not explicitly provided, but based on the test environment and command, it’s possible that the issue is related to memory corruption, a kernel bug, or a hardware issue. Further analysis and testing are required to determine the root cause of the issue.
Could this provide us with some analytical directions?
Hello, we also found such errors in the following test.
Test method:
One network port of two i350 controllers was directly connected, and the Wi-Fi was connected to the same router. At the same time, iperf3 tests were conducted on the network ports of both i350 controllers and the Wi-Fi ports. Memory errors occurred in both controllers.
We did not encounter such problems when using Jetpack 6.0 on the same hardware platform. However, such problems occurred on Jetpack 5.2.1. Can you analyze what the anomaly might be?
Hello, we have also reproduced this issue on the devkit. Our test environment is as follows:
Jetpack: 5.1.2 (Specified by the customer, cannot be modified)
Equipment: Nvidia’s Jetson Orin AGX Development Kit.
Image: According to the official documentation, the image with the real-time kernel has been enabled.
logs:
ubuntu login: [ 3890.786414] arm-smmu 8000000.iommu: Unhandled context fault: fsr=0x80000402, iova=0xff861800, fsynr=0x730003, cbfrsynra=0x805, cb=4
[ 3890.799300] mc-err: (255) csr_pcie1r: EMEM address decode error
[ 3890.805528] mc-err: status = 0x200640da; hi_addr_reg = 0x000000ff addr = 0xffffffff00
[ 3890.813910] mc-err: secure: yes, access-type: read
[ 6851.754380] arm-smmu 8000000.iommu: Unhandled context fault: fsr=0x80000402, iova=0xff9fc800, fsynr=0x730003,cbfrsynra=0x5, cb=4
[ 6851.766975] mc-err: (255) csr_pcie1r: EMEM address decode error
[ 6851.773234] mc-err: status = 0x200640da; hi_addr_reg = 0x000000ff addr = 0xffffffff00
[ 6851.781505] mc-err: secure: yes, access-type: read
[ 7229.742960] arm-smmu 8000000.iommu: Unhandled context fault: fsr=0x80000402, iova=0xff625800, fsynr=0xc0003, cbfrsynra=0xc05, cb=4
[ 7229.760309] mc-err: (255) csr_pcie1r: EMEM address decode error
[ 7229.766488] mc-err: status = 0x200640da; hi_addr_reg = 0x000000ff addr = 0xffffffff00
[ 7229.774774] mc-err: secure: yes, access-type: read
[16931.786458] arm-smmu 8000000.iommu: Unhandled context fault: fsr=0x80000402, iova=0xff7b0800, fsynr=0xc0003, cbfrsynra=0x405, cb=4
[16931.803621] mc-err: (255) csr_pcie1r: EMEM address decode error
[16931.810035] mc-err: status = 0x200640da; hi_addr_reg = 0x000000ff addr = 0xffffffff00
[16931.818369] mc-err: secure: yes, access-type: read