Jetson TK1 r8169 Ethernet cuts on high load

Hai all,
I am having problem with my board because of network cuts. Lately I bought a Gbit switch and setted up a private network for the nodes. Everything works just fine by default but when I switch to jumbo frames and maximum CPU then all nodes become unstable after a while and Ethernet is down with no kernel log. (mostly happening if I dont do a call for a while, makes me feel like something is going idle permanently)

So I find out that it is a common issue with Realtek 8169 drivers under high load:

https://bugzilla.kernel.org/show_bug.cgi?id=14962
or just google: https://www.google.cz/search?client=ubuntu&channel=fs&q=r8169+high+speed&ie=utf-8&oe=utf-8&gfe_rd=cr&ei=GuuFVN3wI4mh8wfu_IGIDg

I was thinking it shouldnt be an issue as current kernel version is 3.10 (i am using l4t 21.1)

Then I find this comment:

https://enc.com.au/2013/10/16/damn-you-unworking-r8168/#comment-2380

I have created a conf file under
/etc/modprobe.d/damn-realtek-firmware.conf
with the content of:

r8169 driver causes periodic disconnection, supposedly this fixes it

options r8169 use_dac=1 debug=8

(apt-get install firmware-realtek didnt work, I couldnt find in arm repo, network manager disabled, no avahi daemon, usb autosuspend disabled)

Also some people advice to recompile driver from Realtek website but the autorun.sh didnt work too probably because of device tree?

Any suggestion?

This is an issue with the Realtek driver in R21.1 (L4T R21.1 just got unlucky, it isn’t directly an issue of L4T and no issue at all in R19.3). See this thread:
https://devtalk.nvidia.com/default/topic/787949/?comment=4359675

The workaround is given by this specific reply in that thread:
https://devtalk.nvidia.com/default/topic/787949/embedded-systems/netdev-watchdog-eth0-r8169-transmit-queue-0-timed-out-with-tegra-r21/post/4359675/#4359675

Thanks @linuxdev I saw this thread and the solution I was hoping a better solution out there without reducing the speed. Anyway I will give a shot and test the bandwith

Ok I did a bit more reading about this driver maybe will save someone time. I have recompiled 21.1 kernel and I have changed r8169 from builtin (*) to Module (M) then I was able to try other drivers.

first I have tried existing r8169 driver with below parameter, network down on client after high load

options r8169 use_dac=1 debug=8

I compiled and loaded r8168-8.039.00 from realtek website eth0 was working but same result on high load its just survived a bit longer then original driver

r8169-6.0.014.00 driver from realtek website didnt recognize eth0

On uboot config disabling acpm didnt help for any driver, I had ethernet cut after high load

pci_acpm=off

each driver has different parameters you can check with

modinfo -p r816x

I have tried with different combinations of kernel parameters didnt help too. Especially some people were lucky about ‘autoneg off’

ethtool -s eth0 autoneg off

Mii-tool watch option didnt provide any output while network down

mii-tool -w eth0

No kernel log on any failure, even I set 7 on menuconfig

I have tried all of these mostly remotely

Some people also wrote that they got lucky after disabling IOMMU on kernel config/bios but I gave up:) I can get 968mbit/s with jumbo frames and when I set mtu 1500 (which is max for 100mbit network) its not even 50mbit/s. I am working on MPI so on some parts of code I rely on network, so I just bought a gigabit switch:( Is there anyone using their devices without any problem on gigabit network remotely?

I recommend my gigabit ethernet solution:
https://devtalk.nvidia.com/default/topic/799075/embedded-systems/jetson-tk1-r8169-netdev-watchdog-timeout-solved-/

Same problem here. I recompiled 3.10.40 without r8169 and installed r8168, but the network still fails with moderate traffic (wget something…). Too bad :-(

Any ideas what to try next?

No idea on what to try, but after poking around my conclusion is that the r8169 driver is not the actual fault…at least not all by itself, so fixing the driver probably will not fix the issue. A wild guess is that the full story involves the scheduler interacting with the driver and that changing the scheduler would be the real fix. The older r8168 likely just interacts differently by good luck (luck is of course a great component…I keep trying to find it in object oriented design books, but they forgot to add this).

Figuring out how the timing and interaction works together would be an extraordinary feat. Somewhere a short distance into the future from 3.10.40 this interaction issue just disappeared.

:-(

I’m using my jetson just as a headless server and would love to use a mainline kernel. I tried 3.18.1 with def_tegra config, but the kernel does not boot. In fact I see nothing via hdmi or via network (no mac seen on the network). :-(

Still waiting for my serial/usb adapter to arrive to see if I can get anything from uboot.

The mainline kernel is still missing some features but it definitely should boot. But with upstream kernel you unfortunately need to use also upstream u-boot. Did you test with that?

Yes, but no luck. I compiled mainline u-boot and tried to install it with tegra-uboot-scripts. With this I can see the ethernetaddr and ip set via --env in the environment, so u-boot is working, but the kernel never boots after this. I also tried with flash.sh -L with mainline u-boot but with this I never seen anything on my network. Do I need to change anything on mainline u-boot to work with flash.sh? (tried several configs for network, preboot and netconsole and applied a patch for pcie to work, but nothing changed)

Maybe some dump mistake somewhere, but without serial console I’m stuck with the default stack…

Trying different u-boots and kernels is something that one shouldn’t even try without a serial cable :)

So I recommend that you wait for the USB-serial cable to arrive and then test again.

There’s some good info about the generic Tegra boot process here:
http://http.download.nvidia.com/tegra-public-appnotes/tegra-boot-flow.html

And slightly outdated info about upstream kernel and u-boot (the links are hopefully still valid):
http://elinux.org/Tegra/Mainline_SW

Yeah, sometimes I’m a little bit masochistic :). Today my USB-serial cable arrived and it works perfect. BUT: With mainline u-boot nothing on the serial console either…

I opened a new topic for this problem: https://devtalk.nvidia.com/default/topic/802953/embedded-systems/how-to-compile-amp-install-mainline-u-boot-/

I’m coming back to this thread if a managed to get mainline u-boot, kernel and hopefully ethernet to work… :)

Mainline kernel is running! I just forgot to change the name of the .dtb file in my extlinux.conf. Kernel 3.18.2 is working like a charm.

And last but not least… network is great! No more problem with build-in r8169 drivers. Problem solved for me. :-)

p.s. Docker works great on tk1. :-)

@coderchris, how did you install Docker on the TK1? Am I about to use the Grinch kernel? I tried using sudo apt-get install docker.io but running sudo docker -d gives me an error.

It’s hard to help if you only say that you see an error. Please do always post also the error message you see.

Here’s the output with error:
ubuntu@tegra-ubuntu:~$ sudo docker -d
[sudo] password for ubuntu:
2015/04/18 13:35:57 WARNING: The docker runtime currently only officially supports amd64 (not arm). THIS BUILD IS NOT OFFICIAL AND WILL NOT BE SUPPORTED BY DOCKER UPSTREAM.
2015/04/18 13:35:57 docker daemon: 1.0.1 990021a; execdriver: native; graphdriver:
[8ec66997] +job serveapi(unix:///var/run/docker.sock)
[8ec66997] +job initserver()
[8ec66997.initserver()] Creating server
2015/04/18 13:35:57 Listening for HTTP on unix (/var/run/docker.sock)
Error running DeviceCreate (createPool) dm_task_run failed
[8ec66997] -job initserver() = ERR (1)
2015/04/18 13:35:58 Error running DeviceCreate (createPool) dm_task_run failed

Docker requires 64-bit platforms:
https://docs.docker.com/installation/ubuntulinux/

Ubuntu on Jetson is 32-bit.

you can check your kernel config with

#on running system
lxc-checkconfig

#before compile
CONFIG=/path/to/kernel_config /usr/bin/lxc-checkconfig

cheers