R28.4 passive cooling device kills CPU/GPU sensor

gavin.lofts · January 18, 2021, 4:49pm

The trip point for the passive cooling device for MCPU and GPU (/sys/class/thermal/thermal_zone1/trip_pointy_1_temp, /sys/class/thermal/thermal_zone2/trip_point_6_temp) seem to stop the respective sensors from working on L4T R28.4. I don’t see this problem on L4T R32.x or L4T R28.2.

These are the steps I take:

Install Jetpack 3.3.3 on the TX2 dev kit
make tensorrt samples: cd /usr/src/tensorrt/samples; sudo make
run googlenet sample on a loop to load GPU: while true; do /usr/src/tensorrt/bin/sample_googlenet; done
Force fan to 0: sudo bash -c "echo 0 > /sys/kernel/debug/tegra_fan/target_pwm"
Change MCPU passive trip: sudo bash -c "echo 50000 > /sys/class/thermal/thermal_zone1/trip_point_1_temp"
Read MCPU sensor while true; do cat /sys/devices/virtual/thermal/thermal_zone1/temp; sleep 5; done

When the temperature crosses the modified trip point, any reads to the MCPU sensor don’t return. The passive cooling device also stops modifying the clock speeds as it should.

This is a problem for me as my real application is GPU intensive and even with full fan, in a hot environment, I rely on the cpu-balanced and gpu-balanced cooling devices to cool the TX2 device.

I’m not sure where to start with debugging the bpmp. I just load the binary supplied in L4T . Could someone help me please.

WayneWWW · January 19, 2021, 3:07am

Thanks for reporting this. Let us check it.

WayneWWW · January 19, 2021, 6:35am

Hi,

We tried to reproduce your issue with r28.4 but cannot repro it.

There are some difference between your setup and ours.
We only set the trip_point_1_temp to 36 to make it more easier to reproduce. We try to raise device to 38C.
The temperature read from thermal zone 1 is always ok when it gets into 38C.

Could you help check what might be the reason we cannot reproduce issue? Do you have other configs that we don’t know?
How about other temperatures from other sensor?
How are their status on your side?

gavin.lofts · January 19, 2021, 8:48am

Things seem a bit different between my D00 revision module and the D02:

For D02 as soon as the temperature goes through the trip point, the sensor no longer works.

For D00 It works correctly when the temperature first exceeds the trip point, but if I use the fan to cool the module below the trip point and then turn the fan off again, things break.

I haven’t tried on any older modules.

I have made an effort to remove all of my proprietary stuff from this, so I just have Jetpack installed and the instructions above.

The other sensors continue to work, so for example if the thermal_zone1 sensor stops working after the experiment above, the other on dies sensors (e.g. thermal_sensor2) and the TX2 board sensor continue to work.

Thanks for helping me out!

WayneWWW · January 19, 2021, 8:58am

Hello,

Does the dmesg give any output when the sensor no longer works?

Also, could you elaborate that case you mentioned for D00 again? Not pretty sure the whole steps on this.

And could you confirm if my setup (setting trip point to 36C) can reproduce issue on your side too? If so, I need to find a D00 or D02 module for it.

gavin.lofts · January 19, 2021, 9:30am

Hello Wayne,

I just remembered one more thing about the configuration. I change the power model sudo nvpmodel -m 0.

I don’t see anything in dmesg when the failure occurs.

I can see the problem when setting the trip point to 36°C.

For D00, this is the series of temperatures that I see leading to the failure:
33000
33000
33000
33000
35000
36500 **Trip point exceeded so cpu-balanced cooling device is activated
37500
37500
35000 **Below trip point so expect cpu-balanced to start turning clocks up
34500
35000
**Readings abre broken now.

After failure, I also tried monitoring the state of the cooling device: cat /sys/class/thermal/thermal_zone1/cdev0/cur_state. I see that the state is stuck at 0.

Thanks,
Gavin

WayneWWW · January 20, 2021, 3:40am

Hi,

There are something I would like to ask.

You are trying D02 module but this kind of module requires to flash with rel-32.4.3/ rel-28.4 (or later).

How did you test this module with rel-28.2.? This is related to Jetson TX2 PCN 206440 DRAMeMMC .
https://developer.nvidia.com/embedded/downloads#?search=pcn

WayneWWW · January 20, 2021, 7:14am

And we test it with D02 module too. Still fail to reproduce issue…

Are you using NV devkit?

gavin.lofts · January 20, 2021, 7:17am

Hello WayneWWW,

I tested with D00 and D02 modules. For D00, I can use r28.2 and r32.4. You’re right, for D02, it won’t boot with r28.2.

I am using an NVIDIA TX2 dev kit that came with a D00 module fitted.

Thanks,
Gavin

WayneWWW · January 20, 2021, 7:24am

Hi,

Do you have other D02 modules that can test? Also, is there any delay to see this behavior on D02?
We only have D02 and some older C0x modules so may need the exact description for D02.

For example, will “fail to return value” happen immediately after the trip point temperature < actual temperature? or we have to wait for a while?

And if this issue very easy to reproduce? Will temperature recover to normal return after a while?

gavin.lofts · January 20, 2021, 8:29am

Hello Wayne,

I will try to get some more D02 modules. I only have 2x D00 and 2x D02 in my lab at the moment.

I will send you a script later today just in case the time between steps is important.

I found that after the problem happened, the temperature sensor would not recover until I rebooted the dev kit. All reads would hang forever.

WayneWWW · January 20, 2021, 8:31am

Ok. Waiting for your script.

gavin.lofts · January 20, 2021, 10:46am

therm_prob.sh (1.7 KB)

I put this script in the home directory of nvidia and then execute. On my machine, I see temperature readings every 5 seconds until the sensor breaks. Here’s what I see when I run on my D00 module:

nvidia@minit-dk5:~$ ./therm_prob.sh 
running loop
/usr/src/tensorrt/bin ~
test running
Running googlenet
nvidia@minit-dk5:~$  MCPU reading: 33000: Date Wed Jan 20 10:38:41 UTC 2021

 MCPU reading: 34000: Date Wed Jan 20 10:38:46 UTC 2021

 MCPU reading: 33500: Date Wed Jan 20 10:38:51 UTC 2021

 MCPU reading: 33500: Date Wed Jan 20 10:38:56 UTC 2021

 MCPU reading: 33500: Date Wed Jan 20 10:39:01 UTC 2021

 MCPU reading: 33500: Date Wed Jan 20 10:39:07 UTC 2021

 MCPU reading: 33000: Date Wed Jan 20 10:39:12 UTC 2021

 MCPU reading: 33500: Date Wed Jan 20 10:39:17 UTC 2021

 MCPU reading: 33500: Date Wed Jan 20 10:39:22 UTC 2021

 MCPU reading: 33500: Date Wed Jan 20 10:39:27 UTC 2021

 MCPU reading: 33000: Date Wed Jan 20 10:39:32 UTC 2021

 MCPU reading: 35000: Date Wed Jan 20 10:39:37 UTC 2021

 MCPU reading: 37000: Date Wed Jan 20 10:39:42 UTC 2021

 MCPU reading: 37500: Date Wed Jan 20 10:39:47 UTC 2021

 MCPU reading: 37500: Date Wed Jan 20 10:39:52 UTC 2021

Running googlenet
 MCPU reading: 36500: Date Wed Jan 20 10:39:57 UTC 2021

 MCPU reading: 35500: Date Wed Jan 20 10:40:02 UTC 2021

 MCPU reading: 35500: Date Wed Jan 20 10:40:07 UTC 2021

 MCPU reading: 35000: Date Wed Jan 20 10:40:12 UTC 2021

Running googlenet
Running googlenet

WayneWWW · January 20, 2021, 11:09am

Hi @gavin.lofts,

Please also try this script on your side with D02 modules.

We will also try it on our device tomorrow.

gavin.lofts · January 20, 2021, 11:22am

Thank you. I will try this script on a D02 module tomorrow afternoon (UK time).

WayneWWW · January 21, 2021, 3:58am

Hi,

Please try with this upstream patch to r28.4 src.
We had similar case months ago and rel-32 is not affected.

38f757b.diff.zip (1.7 KB)

gavin.lofts · January 21, 2021, 9:25am

Thank you Wayne. I am trying the patch now.

gavin.lofts · January 21, 2021, 10:11am

The patch has fixed the problem. Thank you!