Fault Detection/Reaction - On Die Temperature Sensor

Hi NVIDIA!

We’re working on our failure cases and need insight on the AGX Xavier SOC. For the On-Die temperature sensors, are there any fault cases with the sensor that can be monitored/read? If the temperature sensor fails, is there a reaction mechanism taken by the SOC? Please let us know when you can.

Kind Regards,
Matt

Hi,
Please check Thermal Management in document:
https://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide/power_management_jetson_xavier.html#wwpID0E0LM0HA

If there is certain condition you may hit and it looks like the current mechanism does not cover the case, please share us the detail so that we can check and suggest next.

Hey DaneLLL! Thanks for the information. We’re specifically inquiring if there is a failure mode and reaction mechanism to your internal on-die sensors specified in the thermal sensing section of this document.

Example Case: ABI Name “THERMAL_ZONE_AUX” is no longer reporting, has a fault, or is reporting absurd temperature data. Do we have a method of detecting these anomalies? Are there any specific reactions that the SOC takes due to an anomaly? Have you encountered this type of anomaly in the lifetime of this SOC?

Please advise when you can.

Thanks,

Matt

Hi,
By default the chip/module is verified and passes stress tests, so this error case should not happen. If you have concern, would need to add additional hardware mechanism/thermal IC to do further prevention.