I’ve been looking into ARMv8.2’s RAS protocols, the L2/L3 ECC protections, and general system monitoring for health and error trapping. The ARM documentation suggests that all of the RAS and ECC performance reporting is implementation dependent. I’ve been looking around but haven’t found any information detailing the Nvidia implementation. Does the Xavier / Carmel implementation include RAS? Has this been implemented in Tegra?
I’d greatly appreciate a pointer to any available documentation or implementation examples.
Tegra has RAS implementations from TX2 onwards.
As an example You can see the error spew when you try to access some underprivileged/unmapped location.
I don’t think we have released any specific documentation apart from what is mentioned in TRM.
you can see the implementation here drivers/ras/arm64_ras.c
If you are looking for something specific, let us know
Thanks, Bibek. I have found this driver and will look into it what instruments and how we can use it. I can say that I am most interested in hardware faults and I would like to be able to induce errors and see the reporting. For instance, the Linux ECAD module forums describe heating up memory dimms with a heat gun and seeing the errors tick up. I would like to do the same.
I will get back to you when we have more specific questions.
I’m working on this same subject with Sam. In particular, we’re looking to monitor the Corrected Error Counters (CEC) of the Carmel CPU’s. According to the ARM Architecture Reference Manual D13.7, the CEC field is found in the ERXMISC0_EL1 register. We’re reading the ERRIDR_EL1 error ID register to find the number of error records available to read. And we’re also reading the ERXFR_EL1 error feature register for each of the error records, using ERRSELR_EL1 to step through each record.
Can you provide information on which bit fields in ERXMISC0_EL1 (or elsewhere) contain the Corrected Error Counters, and how to map the error record number to the physical CPU ECC element (i.e. L1, L2 or L3? etc)?
HW Correctable counters(CEC) are not supported in Xavier/T194.
ERRFR.CEC, bits [14:12] = 0x0 : [0b000 - Does not implement the standard Corrected error counter model]
To map the error record number to the physical CPU unit:
Please refer ras_mca_get_record_errselr() function in file “drivers/platform/tegra/carmel_ras.c”
Thank you for the information. We’ll take a look at the ras_mca_get_record_errselr() function to help identify the source of each error record. Since the Xavier is not implementing the ARM CEC counter, does the Xavier provide another indicator that could inform us when an ARM RAS error has been corrected?
FHI(Fault Handling Interrupt) will be generated for correctable errors. FHI handler currently is printing details if any correctable error occurs. You can add more logic as per your requirement in FHI handler to know about it.