Jetson TK1 r8169 NETDEV WATCHDOG Timeout solved!

Hi,

Sorry for my bad English. I hope, this solution can help.

I installed R21.2 BSP on my Jetson TK1. As you, when high gigabit ethernet traffic, the ethernet driver stoppped work.

When I tested the high network traffic with “iperf” (between TK1 and an another PC), ALWAYS cut off the ethernet driver in seconds in TK1, with this kernel message:

NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out

Some solutions not worked for Me:

  • Disable ACPI, or APIC and/or LAPIC
  • Disable PCI ASPM
  • Disable IRQROUTING
  • Disable etc… :):):)
    …in kernel parameters

Some solutions only worked for Me:

  • Using 100M network
  • Push back the gigabit port to 100M with “ethtool -s eth0 speed 100 duplex full autoneg on”
  • Use only single core (“nosmp” in kernel parameters)

But solutions above not good. Why degrade the CPU or Network performance??? You paid full price… :):):)

But I never give up… :):):) You will get a 100% solution with this steps:

  1. Install and flash Jetson TK1 BSP R21.2 from Nvidia.
    (After that, you see “uname -r”: 3.10.40-ged4f697)

  2. Download and unpack 3.10.40 TK1 kernel source from Nvidia.

  3. Copy/rename the running kernel config to kernel source folder (config.gz → .config)

  4. In kernel config (make menuconfig), add local version string “-ged4f697”, and disable the built-in Realtek network drivers!!! These steps very important!

  5. Compile kernel. And copy the /arch/arm/boot/zImage to /boot. Do not reboot yet!

  6. Download and unpack the latest R8168 proprietary kernel driver from Realtek.
    http://goo.gl/neQe

  7. Modify “CONFIG_ASPM=y” to “n” in Realtek src/Makefile. Without this, solution works, but you always will give PCIe reply timeout error in kernel log.

  8. Compile the driver

  9. Add “r8168” line to /etc/modules. Important!!

  10. Reboot.

Steps above tested with:
-Jetson TK1 BSP R21.2
-kernel: 3.10.40-ged4f697
-Realtek Linux driver: 8.039

Result:

[    9.838303] r8168 Gigabit Ethernet driver 8.039.00-NAPI loaded
[    9.869231] r8168: This product is covered by one or more of the following patents: US6,570,884, US6,115,776, and US6,327,625.
[    9.869275] r8168  Copyright (C) 2014  Realtek NIC software team <nicfae@realtek.com> 
[    9.869275]  This program comes with ABSOLUTELY NO WARRANTY; for details, please see <http://www.gnu.org/licenses/>. 
[    9.869275]  This is free software, and you are welcome to redistribute it under certain conditions; see <http://www.gnu.org/licenses/>.

When using r8168 driver instead of original r8169, the “iperf” can drive the gigabit port 930/960Mbit/s in full duplex. I tested a whole night, without any issue. The iperf with this network load eat about 1 core CPU power.

If you not want to do steps above, I made a precompiled package. Usage:

  1. Fresh install/flash Tegra TK1 with BSP R21.2
  2. Optional: Push back gigabit to 100M temporally
  3. Download my package: http://goo.gl/KELZPt or see the attachment.
  4. Unpack into TK1 root folder and run “depmod”.
cd /
tar -xzf <path>/Jetson_TK1_R21.2_linux-3.4.10-ged4f697_r8168.tar.gz
depmod

This will overwrite zImage and uimg, add r8168.ko module, and replace /etc/modules
5. Reboot

poweroff

The package only change the built-in r8169 driver to Realtek r8168 driver. No other any modification, or configuration change.

I hope, this helps for many TK1 developers.

Regards, and Merry Xmas… :):):)
Tibor Szolnoki

Jetson_TK1_R21.2_linux-3.4.10-ged4f697_r8168.tar.gz (11.9 MB)

1 Like

Well done Tibor, seems like I was close enough:) Weird but didnt work for me latest driver in the past. Currently I switched back to 19.3 to get networking working that I am stuck on Cuda 6.0 because of that. But if there will be no release soon then I will give a shot your solution. Iperf is a great tool. What is your iperf result? Here is my networking script for tunning if it will helps a bit:

#!/bin/bash
sysctl -w net.core.rmem_max=33554432
sysctl -w net.core.wmem_max=33554432
sysctl -w net.core.rmem_default=33554432
sysctl -w net.core.wmem_default=33554432

sysctl -w net.ipv4.tcp_no_metrics_save=1

sysctl -w net.ipv4.tcp_congestion_control=cubic

sysctl -w net.ipv4.route.flush=1

ifconfig eth0 txqueuelen 5000

echo 10000000 > /proc/sys/fs/file-max

echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout

echo 30 > /proc/sys/net/ipv4/tcp_keepalive_intvl

echo 5 > /proc/sys/net/ipv4/tcp_keepalive_probes

echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle

echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse

ifconfig eth0 mtu 9000

ifconfig

Hello Tibor,
Thanks a lot for your solution!

I had troubles compiling the driver since the /usr/src/linux-headers-3.10.40-ged4f697 dir contains some scripts for 64 bits. I had to make a symlink named /usr/src/linux-headers-3.10.40-ged4f697 to the kernel directory and everything worked fine.

I took a look at your package and I found the linux kernel image (/boot/zImage), the modified /etc/modules, the driver for the Realtek network device (/lib/modules/3.10.40-ged4f697/kernel/drivers/net/r8168.ko). Same files I had modified :)

But, what is the /boot/vmlinux.uimg file for? I have read here (https://devtalk.nvidia.com/default/topic/767123/embedded-systems/jetson-boot-vmlinux-uimg-/ that the bootloader doesn’t need it.

I have two Jetson-TK1 updated with the kernel modification and the compiled realtek driver as you explained, but I didn’t make any modification to /boot/vmlinux.uimg file:

# ll /boot/zImage
-rwxr-xr-x 1 root root 5922136 Jan 16 11:13 /boot/zImage*
# ll /boot/vmlinux.uimg
-rw-r--r-- 1 root root 6139784 Dec 16 11:47 /boot/vmlinux.uimg
# sha1sum /boot/vmlinux.uimg
4b22befc82e703fe8861b1e725b251129f5d6a97  /boot/vmlinux.uimg

And they are booting without any problem…

I have no idea about linux kernel booting and boot loaders :S
Could you clarify that?

Thanks a lot in advance!
Best regards
Fernando

Hi again!

I have been testing my boards with re-compiled kernel and r8168 driver, and they are hanging again with no so intensive network load (downloading an ISO image from the internet). Of course, they also hang with iperf and scp. But now I get no log messages in syslog…

Did you connected to the Jetson using a switch or a direct cable from your desktop to the board? I am using a switch (I have tried with different ones) to connect two Jetson’s and testing among them.

Could you share the iperf commands you used to get 930/960Mbit/s??

Thanks!
Fer

vmlinux.uimg is a kernel format which was once used by u-boot…and is no longer used. Current u-boot for all known generations of L4T on Jetson uses zImage instead. You can safely ignore all information suggesting use of uimg…if this exists in your u-boot setup on Jetson in the /boot directory you can remove vmlinux.uimg.

Thanks for the explanation, linuxdev!

Anybody else used the Tibor’s solution and continued having troubles with the network card?

Hi,

The vmlinux.uimg is a similiar kernel format, and used by U-BOOT.
Generally, Tegra use only the zImage format.
I attached only for reference… If anybody want use U-BOOT.

Are you absolutely sure, the loaded kernel is the modified kernel? And this is uses the r8168 driver?
Try these:
dmesg | grep r8169
You have result (almost) nothing. If you see anything, you booted the original zImgae, not mine.
If you boot the original zImage, the kernel built-in r8169 still active, and will be used. r8168 will be “ignored”.

dmesg | grep r8168
Your result similar to:
r8168 Gigabit Ethernet driver 8.039.00-NAPI loaded
etc…

Thanks for your reply!!

Yes, I have checked I am booting the modified kernel compiled 4 days ago and the Realtek driver is r8168:

root@jetson1 $ uname -a
Linux mlxj1 3.10.40-ged4f697 #1 SMP PREEMPT Fri Jan 16 10:38:55 UTC 2015 armv7l armv7l armv7l GNU/Linux
root@jetson1 $ dmesg|grep r816
[   10.025247] r8168 Gigabit Ethernet driver 8.039.00-NAPI loaded
[   10.045599] r8168: This product is covered by one or more of the following patents: US6,570,884, US6,115,776, and US6,327,625.
[   10.045656] r8168  Copyright (C) 2014  Realtek NIC software team <nicfae@realtek.com> 
[   15.108727] r8168: eth0: link up

Who knows, maybe there is another bug to be discovered, but when I execute:

jetson2:~$ scp -r NVIDIA_CUDA-6.5_Samples_jetson2/ jetson1:.

In less than 1 minute the jetson1 board hangs (actually the network hangs, but the boards are connected to a switch and I can only access them by SSH). I have also tested with different switches.

Thanks in advance for your help!

I had the same problem and needed to migrate to mainline kernel 3.18.2 to get the network to 1Gb. I’m using tk1 as a headless server so I’m happy without some nvidia features…

Here you’ll find a nice explanation from linuxdev: https://devtalk.nvidia.com/default/topic/794758/embedded-systems/jetson-tk1-r8169-ethernet-cuts-on-high-load/post/4414338/#4414338