[35.4.1] system hang after wake from sc7 on Xavier NX devkit

Hi,
we used Xavier NX devkit with the official image: JP512-xnx-sd-card-image.zip.
We found that after wake-on-lan, the system will hang out.

  1. sudo ethtool -s eth0 wol g
  2. sudo systemctl suspend
  3. WolCmd.exe 48b02d078768 192.168.100.100 255.255.255.0
  4. DUT has been wake up; however, the system hang out and most of commands cannot be used, for example, ifconfig, ethtool, cp, etc.

Mar 15 23:08:28 asus-desktop kernel: [ 217.804296] tegra-pmc: Resume caused by WAKE20, 2490000.ethernet:01

Mar 15 23:08:28 asus-desktop kernel: [ 218.911130] nvethernet 2490000.ethernet: [poll_check][42][type:0x4][loga-0x0] poll_check: timeout
Mar 15 23:08:28 asus-desktop kernel: [ 218.911168] nvethernet 2490000.ethernet: failed to poll mac software reset
Mar 15 23:08:28 asus-desktop kernel: [ 218.911181] nvethernet 2490000.ethernet: failed to resume the MAC
Mar 15 23:08:28 asus-desktop kernel: [ 218.911222] PM: dpm_run_callback(): platform_pm_resume+0x0/0x80 returns -1
Mar 15 23:08:28 asus-desktop kernel: [ 218.911231] PM: Device 2490000.ethernet failed to resume: error -1
Mar 15 23:08:28 asus-desktop kernel: [ 218.919273] OOM killer enabled.
Mar 15 23:08:28 asus-desktop kernel: [ 218.919280] Restarting tasks …
kern.log (835.4 KB)

Just to clarify. Does this issue happen every time or it is intermittent?

If it is intermittent, how is the reproduce rate? I mean how many times of suspend did you take to hit this issue?

Hi,
this issue happens every time.
And we also found we can reproduce easily by the following steps:

  1. sudo ethtool -s eth0 wol g
  2. sudo systemctl suspend
  3. Press keyboard to wake the device up
  4. system hang and most of commands cannot be used, for example, ifconfig, ethtool, cp, etc.

Hi fengying_chu,

We can’t reproduce issue on r35.4.1 + Xavier-NX.
Test 10 times WOL are working without hang issue.

Steps:
[Xavier-NX]

$ sudo ethtool -s eth0 wol g
$ sudo systemctl suspend

[Host]

$ sudo apt-get install wakeonlan
$ wakeonlan -i [DUT-IP-Address] [DUT-HW-Address]

I notice you are saying that you used “JP512-xnx-sd-card-image.zip”. Could you try to full flash your board with sdkmanager and test this again? (along with your HDMI disconnect will affect iperf issue)

Hi, I used the following to build and flash the image to Xavier NX, and the issue still occurs:

sudo tar xpf Jetson_Linux_R35.4.1_aarch64.tbz2
sudo tar xpf …/…/Tegra_Linux_Sample-Root-Filesystem_R35.4.1_aarch64.tbz2
sudo ./apply_binaries.sh
sudo ./tools/l4t_flash_prerequisites.sh
sudo BOARDID=3668 BOARDSKU=0001 FAB=100 ./tools/kernel_flash/l4t_initrd_flash.sh --no-flash --network usb0 --massflash 10 jetson-xavier-nx-devkit-emmc mmcblk0p1
sudo ./tools/kernel_flash/l4t_initrd_flash.sh --erase-all --flash-only --massflash 2 --showlogs jetson-xavier-nx-devkit-emmc mmcblk0p1

The kern.log shows:
Mar 15 23:20:18 asus-desktop kernel: [ 678.090082] nvethernet 2490000.ethernet: [poll_check][42][type:0x4][loga-0x0] poll_check: timeout
Mar 15 23:20:18 asus-desktop kernel: [ 678.090089] nvethernet 2490000.ethernet: failed to poll mac software reset
Mar 15 23:20:18 asus-desktop kernel: [ 678.090093] nvethernet 2490000.ethernet: failed to resume the MAC
Mar 15 23:20:18 asus-desktop kernel: [ 678.090127] PM: dpm_run_callback(): platform_pm_resume+0x0/0x80 returns -1
Mar 15 23:20:18 asus-desktop kernel: [ 678.090133] PM: Device 2490000.ethernet failed to resume: error -1

kern.log (286.3 KB)
log.7z (154.0 KB)

Please just let sdkmanager flash your board. Do not prepare any BSP or run any manual flash command by yourself.

Also, could you try to share the log from UART as this issue is related to suspend mode?

In your UART log, try to the steps 5 times and attach the result here.

Hi,
is there any difference between sdkmanager and the manual commands?
I think the manual commands I used are reference from Nvidia documents.
If I used wrong commands, please let me know, thanks a lot.

我能知道為甚麼我得用sdkmanager來複製問題嗎? 使用sdkmanager跟使用nvidia 官方的手動build/flash方法有不同嗎?
我是參考nvidia官方網站的,如果我有用錯commands 來複製問題,還請幫忙告知。

Hi,

這些都只是debug的過程. 減少你可能發生錯誤的機會
比方說好了, 你可以直接用flash.sh, 而不用initrd flash.

另外一個例子, 我也不確定你的 BOARDID=3668 BOARDSKU=0001 FAB=100是不是正確的.
用sdkmanager可以完全確保這些都不會有問題

Hi,
那能否給予commands,讓我這邊能follow up,用手動build/flash方法來複製問題呢?

我們這邊使用sdkmanager來燒、複製問題,對我們是沒有什麼幫助的。

Hi,

現在不是對你有沒有幫助這件事. 現在是先釐清說這個問題到底存不存在

比方說好了, 你前面已經說你在用NV devkit進行複製. 那麼我們就先用最簡單的方式, 也就是sdkmanager來重燒

如果這個方式下問題也不會發生, 我們就回頭review你的步驟哪邊出了問題
如果sdkm也會發生問題, 那麼我們這邊就能debug.

另外, 再提醒一次. 請你就直接用UART來抓log. 不用額外再抓很多次dmesg.

Hi,
I used sdkmanager to flash Xavier Nx devkit. This issue still occurs.
It is easy to reproduce.

  1. sudo ethtool -s eth0 wol g
  2. sudo systemctl suspend
  3. keyboard wake up
  4. system hang
    Just do 1-3 repeatedly, because eth0 will be set to wol off automatically after suspend.

The log shows:
Sep 5 11:21:00 ubuntu kernel: [ 231.329723] tegra-pmc: Resume caused by WAKE80, irq 202
Sep 5 11:21:00 ubuntu kernel: [ 232.431322] nvethernet 2490000.ethernet: [poll_check][42][type:0x4][loga-0x0] poll_check: timeout
Sep 5 11:21:00 ubuntu kernel: [ 232.431331] nvethernet 2490000.ethernet: failed to poll mac software reset
Sep 5 11:21:00 ubuntu kernel: [ 232.431336] nvethernet 2490000.ethernet: failed to resume the MAC
Sep 5 11:21:00 ubuntu kernel: [ 232.431374] PM: dpm_run_callback(): platform_pm_resume+0x0/0x80 returns -1
Sep 5 11:21:00 ubuntu kernel: [ 232.431380] PM: Device 2490000.ethernet failed to resume: error -1

syslog (848.6 KB)

請問你是不是不知道怎麼用uart抓log?

麻煩用這個抓, 把你所有操作的內容跟出來的log都全部抓下來

而不是出了error之後等下次開機才從syslog抓

另外, 想請問一下, 你這個測試跟WoL有關係嗎? 我怎麼看起來你只是在做一般的wake/suspend? 單純用keyboard做wakeup.

Hi,
Just I said, it is easy to have this issue on devkit.

  1. sudo ethtool -s eth0 wol g
  2. sudo systemctl suspend
  3. keyboard wake up
  4. system hang
    Just do 1-3 repeatedly, because eth0 will be set to wol off automatically after suspend.

Hi,

Ok, 所以請你可以起碼至少抓一份uart log給我們確認說你的system hang發生的時候是為什麼hang嗎?
抓UART log對這種問題是很重要的. 請你以後一開始抓log就用這個方法抓. 因為syslog沒辦法看到你當時系統hang住的時候印了什麼

我們現在比較在意的是你的system hang的問題. WoL會被設回off是預期的行為.

Hi,
The uart log:
uart_log.txt (74.5 KB)

Hi @fengying_chu

謝謝分享
可以跟你請教一下實際上大概要測幾次才會碰到這個問題嗎?

從你們之前的敘述. 聽起來非常容易能複製到問題, 但我們這端昨天用WoL 叫醒了10次都沒有碰到

總感覺上應該不是你說的 "always"能複製到問題的狀況.