Bridge network errors using docker-ce in swarm mode

I am attempting to use a TX1 as worker within a docker swarm.

The swarm also contains two x86_64 nodes - the master and another worker - both running stock latest revisions of Ubuntu 18.04 and docker-ce 18.09.3.

The TX1 is in a standard dev board, is running a somewhat stripped down install of L4T from Jetpack 3.3, and with docker-ce 18.09.3.

Using docker on the TX1 standalone works fine for simple things at least. I can start and stop containers and connect to them etc.

When I then try to add the TX1 to the cluster after the usual set of info level messages wrt the gossip cluster getting wired up I then see two error messages in ‘journalctl -u docker.service’:

Mar 19 10:06:47 tegra-ubuntu dockerd[1028]: time=“2019-03-19T10:06:47Z” level=error msg=“enabling default vlan on bridge br0 failed open /sys/class/net/br0/bridge/default_pvid: permission denied”
Mar 19 10:06:47 tegra-ubuntu dockerd[1028]: time=“2019-03-19T10:06:47.662790794Z” level=error msg=“reexec to set bridge default vlan failed exit status 1”

… which I don’t see on my other worker.

The TX1 appears to have joined the cluster successfully from the manager:

$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
s14o76ap2sdgf2g7jfyka8b5h geoff-OldMacBook Ready Active 18.09.3
1k74jna5lhfba50ks6g7k0r7e tegra-ubuntu Ready Active 18.09.3
94zgws7ym9dwbo2x6be67hp9u * toc17-office Ready Active Leader 18.09.3

When I try to deploy a stack which puts a simple container onto the TX1 based on the following compose file fragment:

pub:
image: ros:melodic-ros-core
environment:
- “ROS_MASTER_URI=http://ros-master:11311
- “ROS_HOSTNAME=pub”
command: stdbuf -o L rostopic pub /turtle1/cmd_vel geometry_msgs/Twist -r 1 – ‘[2.0, 0.0, 0.0]’ ‘[0.0, 0.0, -1.8]’
deploy:
placement:
constraints: [node.hostname == tegra-ubuntu]

It seems to get stuck in a continual fail-restart loop. If I deploy it to the other worker instead it works fine. I can run that container image directly on the TX1 outside the swarm via docker run

I am guessing that something in the L4T setup is causing the docker swarm overlay network not to be created correctly? Has anybody come across this before?

Thanks,

Geoff

For completeness I have tried again with the stock Ubuntu 16.04 derived docker that comes with L4T (1.13.1) and that has the same problem :-<!

Geoff

Having dug a little further into it it does seem to be a network setup issue in the vicinity of the TX1.

If I create a quiescent container on each worker using the following compose fragment:

pub:
image: ros:melodic-ros-core
environment:
- “ROS_MASTER_URI=http://ros-master:11311
- “ROS_HOSTNAME=pub”
command: sleep 999999
deploy:
placement:
constraints: [node.hostname == tegra-ubuntu]

pub2:
image: ros:melodic-ros-core
environment:
- “ROS_MASTER_URI=http://ros-master:11311
- “ROS_HOSTNAME=pub2”
command: sleep 999999
deploy:
placement:
constraints: [node.hostname == geoff-OldMacBook]

… and then shell in to each by docker exec’ing /bin/bash I find that:

  • ros-master resolves to 10.0.22.8 on both workers
  • “ping ros-master” works correctly on both workers
  • “curl http://ros-master:11311” succeeds on the other worker but gets connection refused on the TX1

Intriguing!,

Geoff

PS. ros-master is another container in the swarm running on the manager node

Focusing on the network setup on the two works there does seem to be an extra bridge interface on the working one compared to the TX1.

Working Ubuntu 18.04 worker:

$ ip -o a
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever
2: enp0s10 inet 192.168.0.3/24 brd 192.168.0.255 scope global noprefixroute enp0s10\ valid_lft forever preferred_lft forever
2: enp0s10 inet6 fe80::a673:3634:ff93:a8c2/64 scope link noprefixroute \ valid_lft forever preferred_lft forever
5: wls3 inet 192.168.218.129/24 brd 192.168.218.255 scope global dynamic noprefixroute wls3\ valid_lft 60406sec preferred_lft 60406sec
5: wls3 inet6 fe80::f51c:2d63:6ef1:b748/64 scope link noprefixroute \ valid_lft forever preferred_lft forever
6: docker0 inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\ valid_lft forever preferred_lft forever
7: br-26255e7ca904 inet 172.18.0.1/16 brd 172.18.255.255 scope global br-26255e7ca904\ valid_lft forever preferred_lft forever
8: docker_gwbridge inet 172.19.0.1/16 brd 172.19.255.255 scope global docker_gwbridge\ valid_lft forever preferred_lft forever
8: docker_gwbridge inet6 fe80::42:22ff:febb:c01f/64 scope link \ valid_lft forever preferred_lft forever
33: veth1fe0a5f inet6 fe80::e064:92ff:fed5:46d5/64 scope link \ valid_lft forever preferred_lft forever
46: enp0s4f1u1 inet6 fe80::2ec1:94bd:6108:6f78/64 scope link noprefixroute \ valid_lft forever preferred_lft forever
47: enp0s4f1u1i5 inet6 fe80::db42:3d8a:4c38:3cf2/64 scope link noprefixroute \ valid_lft forever preferred_lft forever
61: vethf8cf197 inet6 fe80::a01b:32ff:fe66:57d7/64 scope link \ valid_lft forever preferred_lft forever

TX1 L4Tworker:

$ ip -o a
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever
6: usb0 inet6 fe80::f886:67ff:fe3d:bb53/64 scope link \ valid_lft forever preferred_lft forever
7: usb1 inet6 fe80::ac3a:37ff:fed1:eb8/64 scope link \ valid_lft forever preferred_lft forever
8: l4tbr0 inet 192.168.55.1/24 brd 192.168.55.255 scope global l4tbr0\ valid_lft forever preferred_lft forever
8: l4tbr0 inet6 fe80::fc5c:7dff:fea1:7948/64 scope link \ valid_lft forever preferred_lft forever
11: eth1 inet 192.168.0.2/24 brd 192.168.0.255 scope global eth1\ valid_lft forever preferred_lft forever
11: eth1 inet6 fe80::204:4bff:fe5a:b45d/64 scope link \ valid_lft forever preferred_lft forever
12: docker0 inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\ valid_lft forever preferred_lft forever
12: docker0 inet6 fe80::42:23ff:feda:569d/64 scope link \ valid_lft forever preferred_lft forever
13: docker_gwbridge inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge\ valid_lft forever preferred_lft forever
13: docker_gwbridge inet6 fe80::42:aff:fe51:e49d/64 scope link \ valid_lft forever preferred_lft forever
29: veth1f61ce0 inet6 fe80::e881:baff:fe04:20b1/64 scope link \ valid_lft forever preferred_lft forever
247: veth53367a5 inet6 fe80::3c01:e8ff:fef6:eabe/64 scope link \ valid_lft forever preferred_lft forever

Geoff

Digging deeper into the running containers I notice that the one running on the TX1 has a set of extra network devices compared with containers running on either the other worker or the manager node:

root@cc1d3ae129ad:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: tunl0@NONE: mtu 1480 qdisc noop state DOWN group default qlen 1
link/ipip 0.0.0.0 brd 0.0.0.0
3: sit0@NONE: mtu 1480 qdisc noop state DOWN group default qlen 1
link/sit 0.0.0.0 brd 0.0.0.0
4: ip6tnl0@NONE: mtu 1452 qdisc noop state DOWN group default qlen 1
link/tunnel6 :: brd ::
244: eth0@if245: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 02:42:0a:00:16:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.22.3/24 brd 10.0.22.255 scope global eth0
valid_lft forever preferred_lft forever
246: eth1@if247: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet 172.18.0.3/16 brd 172.18.255.255 scope global eth1
valid_lft forever preferred_lft forever

They are all NOARP+DOWN so shouldn’t be doing anything but unexpected. Here is the list from the other worker to compare:

root@270468515fa2:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
58: eth0@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 02:42:0a:00:16:06 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.22.6/24 brd 10.0.22.255 scope global eth0
valid_lft forever preferred_lft forever
60: eth1@if61: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:13:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet 172.19.0.3/16 brd 172.19.255.255 scope global eth1
valid_lft forever preferred_lft forever

Geoff

Hi,

I find a similar issue on docker’s GitHub here:
[url]https://github.com/docker/for-linux/issues/89[/url]

Could you check if you meet the same issue?
Thanks.

That ticket mentions the same error string but the topic is actually a typo in the error string in some docker versions - not why it occurs unfortunately.

I guess the interesting question is why docker is trying to manipulate bridge br0 when there is no device of that name defined.

Geoff

I switched the module on the board for a TX2 and installed L4T R32.1 and the problem doesn’t occur with that setup so I guess the problem is to do with the network setup in the Jetpack 3.3 L4T image.

When can we expect an update to Jetpack 3.3 to fix this or the TX1 to be supported by Jetpack 4.x?

Thanks,

Geoff

Hi, Geoff

Sorry that we cannot disclosure our future plan here.
But it should take at least few months.

I’m not sure if this is a possible workaround but it looks that you can manually setup a bridge for docker.
https://docs.docker.com/v17.09/engine/userguide/networking/default_network/build-bridges/

$ sudo ip link set dev br0 up
$ sudo ip addr add 192.168.5.1/24 dev bridge0
$ sudo ip link set dev bridge0 up

Could you try if it helps?
Thanks.

It turns out I spoke too soon - it was still broken in Jetpack 4.2. I actually went a different way and am running the swarm in host networking mode to avoid the overlay network altogether.

Geoff