Network crash after wake on land with direct cable

Hi All,

I encountered the same issue as in this post[1], but we need to use a direct wire instead of working with a switch.
Any help will be appreciated.

We can reproduce the problem always


Setup:
Direct or cross Ethernet cable from Xavier NX to a PC
Fixed IP in both sides.

Commands Nvidia:
sudo ethtool -s eth0 wol g
sudo systemctl suspend

Command Pc:
sudo etherwake -i eno1 [MAC]

Nvida wake up but network lights stop, we can read this message in dmesg, also after 2 minuts the PC reboot.

[ 158.643764] eqos 2490000.ether_qos: WoL Failed to reset MAC
[ 158.644047] dpm_run_callback(): eqos_resume_noirq+0x0/0x1d0 returns -19
[ 158.644282] PM: Device 2490000.ether_qos failed to resume noirq: error -19

[1] Issue with Wake-on-LAN on Xavier NX - #9 by WayneWWW

Will the WoL work when you connect both devices to the switch?

Yes it works.

Hi, I just tried with my local setup, which is same as what you said.

But our NX is still able to be woke up by the host side. Is your NX 100% not able to wake on by the host? I mean cold boot the device and try it 10 times, will it fail to wake up 10 times?

It wakes up each time, but after the wake up:

  1. Networks stop working intermediate (lan lights go off, no answer from ssh)
  2. I can read the mentioned message in dmesg (using hdmi and keyboard)
  3. System reboot in some minutes

Can you share the panic log? If it reboot, probably has kernel panic.

Thank you for your support.
I send the kernel and sys log of a crash. Reboot was on 12:35.
kernelLog.txt (379.3 KB)

syslog.txt (170.4 KB)

Hi,

syslog will not record the kernel panic. Please use the uart log to monitor.

We don’t have the interface, I will come back with the log when I get this.
Thank you.

This is the kernel panic information.
nvidiaWOLConsole.txt (51.9 KB)
[ 62.538111] cache: parent cpu1 should not be sleeping
[ 62.550725] cache: parent cpu2 should not be sleeping
[ 62.582729] cache: parent cpu3 should not be sleeping
[ 62.612976] cache: parent cpu4 should not be sleeping
[ 62.648782] cache: parent cpu5 should not be sleeping
[ 68.354126] eqos 2490000.ether_qos: WoL Failed to reset MAC
[ 68.354391] dpm_run_callback(): eqos_resume_noirq+0x0/0x1d8 returns -19
[ 68.354633] PM: Device 2490000.ether_qos failed to resume noirq: error -19
[ 68.397817] tegra_cec 3960000.tegra_cec: Resuming
[ 68.398052] tegra_cec 3960000.tegra_cec: tegra_cec_init started
[ 69.408235] tegra_cec 3960000.tegra_cec: tegra_cec_init Done.
[ 243.040291] INFO: task whoopsie:5649 blocked for more than 120 seconds.
[ 243.040473] Not tainted 4.9.201-tegra #1
[ 243.040556] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this.
[ 243.041003] Kernel panic - not syncing: hung_task: blocked tasks
[ 243.041125] CPU: 0 PID: 683 Comm: khungtaskd Not tainted 4.9.201-tegra #1
[ 243.041246] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[ 243.041356] Call trace:
[ 243.041417] [] dump_backtrace+0x0/0x198
[ 243.041544] [] show_stack+0x24/0x30
[ 243.041649] [] dump_stack+0xa0/0xc8
[ 243.041779] [] panic+0x12c/0x2a8
[ 243.041880] [] watchdog+0x320/0x3d0
[ 243.041976] [] kthread+0xec/0xf0
[ 243.042066] [] ret_from_fork+0x10/0x30
[ 243.042176] SMP: stopping secondary CPUs
[ 243.042462] Kernel Offset: disabled
[ 243.042719] Memory Limit: none
[ 243.042947] trusty-log panic notifier - trusty version Built: 08:40:57 Feb 1.

Hi jordimarine, I meet same error like yours, is there any update form this topic?

We are working on fix.

We are not using WoL because of this issue.
A fix would be great.

Hi WayneWWW, can your site reproduce it yet, or had you got more information about this issue.

static INT eqos_car_reset(struct eqos_prv_data *pdata)
{
	/* one sec timeout */
	ULONG retry_cnt = (500 * 1000);
	ULONG vy_count = 0;
	ULONG dma_bmr;
	/* deassert rst line */
	if (!IS_ERR_OR_NULL(pdata->eqos_rst)){
		dev_err(&pdata->pdev->dev,"deassert rst line ! null");
		reset_control_reset(pdata->eqos_rst);
	}
	else
	{
		dev_err(&pdata->pdev->dev,"deassert rst line was null");
	}
	/* add delay of 10 usec */
	udelay(10);

	while (vy_count < retry_cnt) {
		DMA_BMR_RD(dma_bmr);    // get address via dma
		if (GET_VALUE(dma_bmr,  //  get value via dma
			DMA_BMR_SWR_LPOS, DMA_BMR_SWR_HPOS) == 0) {
			return Y_SUCCESS;
		}
		vy_count++;
		udelay(10);
	}
	return -Y_FAILURE;
}

static int eqos_resume_noirq(struct device *dev){
...
	if (device_may_wakeup(&ndev->dev)) {
		disable_irq_wake(pdata->phydev->irq);
		/* issue CAR reset to device */

		ret = hw_if->car_reset(pdata);
		if (ret < 0) {
			dev_err(&pdata->pdev->dev, "WoL Failed to reset MAC, try again\n");
			// return -ENODEV;  
			
			ret = hw_if->car_reset(pdata);
			if (ret < 0) {
				dev_err(&pdata->pdev->dev, "WoL Failed to reset MAC\n");
				return -ENODEV; // why not return timeout here?
			}
		}
		eqos_start_dev(pdata);
...

I thought it might be function ‘eqos_car_reset’ return the error, it seems phy reset timeout. but I don’t know why it timeout because I have no document about it, also I don’t know it’s software or hardware cause the issue.