Frequency Scaling not taking place on TX2 under load at high ambient temperature

Hello,

TL;DR;
The TX2 when loaded and at high temperature in nvpmodel -m 1 fails to trip any of the emergency-balanced, gpu-balanced or cpu-balanced frequency scaling. It is pulled into reset without any of the trip points in the thermal zones being triggered. Obviously this is undesired behaviour.

Details:

I’m characterizing the thermal performance of a system we’re developing that utilizes the TX2. One of our specifications calls for the unit to function at 74oC ambient. I’ve now tested at least 4 TX2s in a thermal chamber operating under load.

The load test I utilize maxes out all CPUs, the GPU, the encoder, the decoder and has EMC greater than 30%. It is the following:

  • 4 x opencv_perf_cudafeatures2d
  • 5 x opencv_perf_calib3d
  • 1 x gstreamer transcode
    As each process completes another process is spawned to keep the load even.

I reviewed the thermal zones and trip points defined in the DTS and monitored BCPU-therm, MCPU-therm, GPU-therm, PLL-therm, A0-therm, PMIC-Die and thermal-fan-est. At no point during my testing did any of the trip points that would trigger emergency-balanced, cpu-balanced, gpu-balanced trigger. I have been running with TX1 profile (i.e. nvpmodel -m 1).

I have been able to successfully to run a unit under this load at an ambient temperature of 73oC for over 12 hrs. I have also confirmed in each case that all thermals have stabilized.

When I increased the ambient above 73oC, however, the TX2 suddenly went into reset. None of the trip points for frequency scaling have been triggered. What is triggering the system reset (clearly not kernel)? Have you seen this in thermal testing at Nvidia?

I need to find a solution to ensure the system continues to operate (albeit at much reduced performance) through extended periods at 74oC.

The temperatures I read prior to the reset were:
BCPU-therm 89000
MCPU-therm 89000
GPU-therm 95500
PLL-therm 89000
AO-therm 93500
PMIC-Die 100000
thermal-fan-est 87500

Hi,
There have hardware thermal shutdown by thermal sensor tmp451 on module, can you check if it config correctly?
/hardware/nvidia/platform/t18x/common/kernel-dts/t18x-common-platforms/tegra186-quill-thermal-dtsi.
Search TMP451 in https://developer.nvidia.com/embedded/dlc/l4t-documentation-28-1 can find more.

Hi Jim,

Thanks for the reference I will review it in detail. I had previously read the L4T developer guide for doing BSP customization but found it largely incomplete. I hadn’t even considered using it when reviewing thermal management schemes.

The reason I left out the Tboard_tegra and Tdiode_tegra was that I don’t get readings from them. Reading /sys/class/thermal/thermal_zone5/temp results in “temp: Invalid argument”. Is there any reason why this might be the case? My DTS is based on the standard quill dts for the devkit. I haven’t modified any of the bits in tegra186-quill-thermal-dtsi.

Thanks.

Hello,

I am still seeing failures when the BCPU and MCPU temperatures approach 89000. I see no indications of triggers in linux. Neither Tboard_tegra or Tdiode_tegra report temperatures. Frequency scaling does occur if I manually lower the trip points for cpu-balanced and gpu-balanced scaling on thermal_zone0, thermal_zone1 and thermal_zone4. Can you indicate what could be triggering the hardware reset, what the actual limits are and how to confirm the source of the trigger? Based on my observations I would recommend changing the default trip points provided in the L4T (I will customize our own DTS to resolve this issue).

Hi dthompson,

For previous question, could you give out boot log, output result of this thermal.sh and following debugfs?

#thermal.sh
balanced="cpu-balanced"
throt="/sys/kernel/debug/tegra_throttle/${balanced}"

show_therm_entry ()
{
    tn=$1
    tt=$(cat trip_point_${tn}_temp)
    if [[ $tm -gt $tt ]]
	then
	echo -n ">"
    else
	echo -n " "
    fi
    echo -n "n/a	trip_point_${tn}	 $tt	"
    ty=$(cat trip_point_${tn}_type)
    ok[$tn]=0
    echo "${ty:0:6}	n/a	<unknown>"
}

show_cd_therm_entry ()
{
    en=$1
    cn=${en:4:5}
    tn=$(cat cdev${cn}_trip_point)
    echo -n "cdev${cn}	"
    echo -n "trip_point_${tn}	"
    tt=$(cat trip_point_${tn}_temp)
    if [[ $tm -gt $tt ]]
	then
	echo -n ">"
    else
	echo -n " "
    fi
    echo -n "$tt	"
    ty=$(cat trip_point_${tn}_type)
    ok[$tn]=0
    echo -n "${ty:0:6}	"
    echo -n "$(cat cdev${cn}/cur_state) / $(cat cdev${cn}/max_state)	"
    if [ "$(cat cdev${cn}/type)" == $balanced ]
	then
	lim=$(cat $throt | grep "\[$(cat cdev${cn}/cur_state)\]")
	echo "$(cat cdev${cn}/type) $lim"
    else
	echo "$(cat cdev${cn}/type)"
    fi
}

show_therm_table ()
{
    bw=$(pwd)
    cd $1
    tm=$(cat temp)
    ty=$(cat type)
    echo "${1:19:28}  \"$ty\"	temp: $tm mC"
    echo "  policy: $(cat policy) / { $(cat available_policies)}"
    echo "    passive_delay: $(cat passive_delay)     polling_delay: $(cat polling_delay)"
    if [[ ! -d cdev0 ]]
	then
	echo "------------------------------------------------------------------"
	return
    fi

    ok=()
    for tx in $(ls -d trip_point_?_type)
      do
      tn=${tx:11:1}
      ok[$tn]=1
    done
    if [[ -f trip_point_10_type ]]
	then
	for tx in $(ls -d trip_point_??_type)
	  do
	  tn=${tx:11:2}
	  ok[$tn]=1
	done
    fi

    while [[ $tn -ge 0 ]]
      do
      tn=$(($tn-1))
    done

    echo "cdev#	trip_point_#	temp mC	type   	state	name	"
    echo "-----	------------	-------	-------	-------	------------------"
    for cd in $(ls -d cdev?)
      do show_cd_therm_entry $cd
    done
    if [[ -d cdev10 ]]
	then
	for cd in $(ls -d cdev??)
	  do
	  show_cd_therm_entry $cd
	done
    fi

    while [[ $tn -ge 0 ]]
      do
      if [[ ${ok[$tn]} -ne 0 ]]
	  then
	  show_therm_entry $tn
      fi
      tn=$(($tn-1))
    done

    echo "------------------------------------------------------------------"
    cd $bw
}

# Show thermal_zone summary
show_thermal_zone_summary ()
{
    for zn in $(ls -d /sys/class/thermal/thermal_zone?)
      do
      show_therm_table $zn
    done
    if [[ -d /sys/class/thermal/thermal_zone10 ]]
	then
	for zn in $(ls -d /sys/class/thermal/thermal_zone??)
	  do
	  show_therm_table $zn
	done
    fi
}

show_cooling_device_entry ()
{
    ce=$1
    cn=${ce:19:26}
    echo -n "$cn 	"
    echo -n "$(cat $1/cur_state) / $(cat $1/max_state)	"
    echo "$(cat $1/type)	"
}

# Show cooling_device summary
show_cooling_device_summary ()
{
    echo "cooling_device# 	state	name	"
    echo "----------------	-----	------------------------------------"
    for zn in $(ls -d /sys/class/thermal/cooling_device?)
      do
      show_cooling_device_entry $zn
    done
    if [[ -d /sys/class/thermal/cooling_device10 ]]
	then
	for zn in $(ls -d /sys/class/thermal/cooling_device??)
	  do
	  show_cooling_device_entry $zn
	done
    fi
}

echo "------------------------------------------------------------------"

show_thermal_zone_summary
show_cooling_device_summary
ls /proc/device-tree/i2c@c250000/temp-sensor@4c/
cat /proc/device-tree/i2c@c250000/temp-sensor@4c/name
cat /proc/device-tree/i2c@c250000/temp-sensor@4c/compatible
ls /sys/devices/c250000.i2c/i2c-7/7-004c
cat /sys/devices/c250000.i2c/i2c-7/7-004c/temperature_overheat
cat /sys/devices/c250000.i2c/i2c-7/7-004c/temperature

Hi dthompson,

Could you provide more information?
Or issue has been clarified and resolved?

Thanks

I just returned from the holidays. I will get back to you shortly with the data requested. I have functionally resolved the issues by lowering the trip points in the DTS but still have no root cause on what is actually resetting the board at high temperature.