Drive PX2 rebooting at high CPU load

We have a issue of DrivePX2 rebooting at high CPU load.
It occurs only in TegraA, not in TegraB.

For example, it occurs at following steps.

  • step 1: Execute the command "stress -c 4".
  • step 2: Click desktop icon "DriveNet(file)".

The log when the reboot occured is below.
The log is the output to HostPC from debug port on DrivePX2.

[  809.909543] nvgpu: 0000:04:00.0           __nvgpu_check_gpu_state:60   [ERR]  GPU has disappeared from bus!!
[  809.911297] nvgpu: 0000:04:00.0           __nvgpu_check_gpu_state:61   [ERR]  Rebooting system!!
Feb 17 21:59:11 tegra-ubuntu kernel: [  809.909543] nvgpu: 0000:04:00.0           __nvgpu_check_gpu_state:60   [ERR]  GPU has disappeared from bus!!
Feb 17 21:59:11 tegra-ubuntu kernel: [  809.911297] nvgpu: 0000:04:00.0           __nvgpu_check_gpu_state:61   [ERR]  Rebooting system!!
Feb 17 21:59:11 tegra-ubuntu kernel: [  810.160396] tegra-xusb 3530000.xhci: Host not halted after 16000 microseconds.
[  810.235001] tegradc 15220000.nvdisplay: can't set parent_clk_safe for sor->ref_clk
Feb 17 21:59:11 tegra-ubuntu kernel: [  810.235001] tegradc 15220000.nvdisplay: can't set parent_clk_safe for sor->ref_clk
Feb 17 21:59:11 tegra-ubuntu kernel: [  810.236982] tegra_nvdisp_handle_pd_disable: Powergated Head2 pd
Feb 17 21:59:11 tegra-ubuntu kernel: [  810.247341] sdhci-tegra 3460000.sdhci: Tuning done, restoring the best tap value : 10
Feb 17 21:59:11 tegra-ubuntu kernel: [  810.249351] therm-fan-est: shutting down
Feb 17 21:59:11 tegra-ubuntu kernel: [  810.249592] nvgpu: 18000000.vgpu                 gk20a_pm_shutdown:872  [INFO]  shutting down
[  810.323950] tegradc 15200000.nvdisplay: can't set parent_clk_safe for sor->ref_clk
[  810.327346] reboot: Restarting system
[    0.000000] Booting Linux on physical CPU 0x100
[    0.000000] Linux version 4.9.80-rt61-tegra (buildbrain@mobile-u64-3039) (gcc version

The software version is:

  • TegraA/B : DRIVEOS 5.0.10.3
  • Aurix : DRIVE-V5.0.10-P2379-EB-Aurix-With3LSS-4.02.04

We have another DrivePX2.
So we tried with the another one, but this did not happen.
Therefore we’d think this issue uniquely only with this one.

Does anyone know what causes this error and how to fix it?

Thanks.

Dear atsutaka,

Thank you for your registration this topic.
May I know how much repro rate of this problem? Thnaks.

Dear SteveNV,

Thank you for your reply.
It occured 100% at 10 tests.

Thanks.

Dear atsutaka,

Thank you for your update.
I want to ask you one more thing.
Is it possible to get complete dmesg log? Thanks.

Dear SteveNV,

The kern.log is attached.
The reboot occuered between lines 15225 and 15226 in the kern.log.

15225 Feb 20 00:11:29 tegra-ubuntu kernel: [  122.626242] eqos ioctl: HW PTP not running
15226 Feb 20 00:13:45 tegra-ubuntu kernel: [    0.000000] Booting Linux on physical CPU 0x100
1

Please tell us if there is any information we should check.

Thanks.
kern.log (1.61 MB)

Dear atsutaka,

Thank you for your update.
We will look into this topic and update. Thanks.

Dear SteveNV,

Any update?

Dear atsutaka,

We are still debugging this topic. need more time to debug. Sorry for the inconvenience.

Dear atsutaka,

Sorry for late update.

Could you please make sure that the reboots are not because of thermal issues?
And please log the temperature regularly (maybe every 1 sec) while running the high CPU load test. Check whats the last reported temperature value.

$cat /sys/bus/i2c/devices//ext_temperature

Dear SteveNV,

I have checked temperature values when rebooting.
The thermal log is attacted.

Thanks.
px2_reboot_thermal.log (2.43 KB)

Hi atsutaka,

Is this still an issue at your side? Any information can be shared?

Thanks

Hi kayccc,

Yes, this issue still remain.
I guess it is HW related issue.

Please tell me if there is any informations we should check.

Thanks.

Dear atsutaka,

Sorry for late update.
Could you please check if fan is spinning or not , as its ocurrs on high load only?
Could you please re-flash the board if possible and re-test it? Thanks.

Hi SteveNV,

The fan of TegraA’s dGPU did not work because of a foreign object.
After removing it, the fan is spinning and this issue has been resolved.

I’m sorry to forget this basic check…
Thank you for your support!

Dear atsutaka,

Thank you for your update.
We will close this bug ticket. Thanks.