Eqos 2490000.ether_qos eth0: eqos_start_xmit(): TX ring full for queue 0

zschutschke · December 8, 2020, 12:21pm

While streaming (4k + pointcloud) video from the Jetson Xavier NX I repeatedly ran into networking issues. Other threads on that issue exists, e.g.

but they are not showing any real solutions. I tried to fiddle around with queue sizes, real-time kernel process priorities etc. but nothing helps. The network hardware (eth0) stops working after issuing this error. A reboot or ip down/ip up helps, but is not really a solution for our scenario. Our setup:

docker container with argus/egl pipeline to gRPC → docker container gRPC traefik proxy → external client

The issue seems to be rooted in the eqos kernel module, where some condition seems to “kill” the kernel module in case of an unexpected “return NETDEV_TX_BUSY”.

Now I changed the TX_BUF_SIZE (increase) and EQOS_MAX_DATA_PER_TX_BUF to 8KB (commented the active “for testing purpose” line with 4KB) in nvidi/eqos/yheader.h, this seems to help → not sure how stable this is. Any thoughts on that?

WayneWWW · December 9, 2020, 2:58am

Hi,
Could you share your application for us if we want to reproduce your issue locally?

And can you at least share us the full dmesg log?

zschutschke · December 15, 2020, 6:07pm

Hi Wayne,

today I found a few minutes to create a test app setup:
(1) network must be gigabit (with less it won’t crash the eqos)
(2) you’ll need traefik and a python3 container (see below)

I didn’t manage to create the issue without traefik in-between but I am positive, that the issue is not 100% related to traefik, as others observed it in other setups - probably traefik is doing “something” to the network packages which increased the chances of breaking stuff.

First of all: the dmesg (79.2 KB)

I’d need some more time to prepare an easy-to-launch environment, but here, you’ll find the python3 server/client which are flooding the ring-buffer (I run server on jetson behind traefik and the client on my laptop, single server, single client) and the traefik environment comes from dockerhub: traefik:v2.3 and configured like below. I think this should be easy to set up on your side?

version: "3.5"
services:
  proxy:
    image: "traefik:v2.3"
    container_name: "traefik"
    restart: always
    command:
            #- "--log.level=DEBUG"
            #- "--api.insecure=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.grpc.address=:8888"
    networks:
      - mynet
    ports:
      - "80:80"
      - "8888:8888"
      #- "8080:8080" # dashboard
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
    
networks:
    mynet:
        external: true

For the sample server, you’ll need some additional labels in your docker-compose, you could use the 3dvl/jupyter docker image, as it has the python dependencies installed - or use any other python3 and install numpy and grpcio-tools manually.

         labels:
            - "traefik.enable=true"
            - "traefik.http.routers.stereo.rule=Headers(`app`, `stereo`) && Headers(`content-type`, `application/grpc`)"
            - "traefik.http.routers.stereo.entryPoints=grpc"
            - "traefik.http.services.stereo.loadBalancer.server.port=51346"
            - "traefik.http.services.stereo.loadBalancer.server.scheme=h2c"

Hope that helps? If you need further assistance to reproduce, please let me know.

Best,

Axel

WayneWWW · December 16, 2020, 3:11am

Hello,

We don’t have experience in traefik container and docker. Could you

try to repro this issue with local L4T w/o any container.
If (1) is not feasible, please provide step by step setup method for us. In case we do anything wrong.

Topic		Replies	Views
Eqos dying quietly after TX ring buffer full message, lose all comms over ethernet Jetson Xavier NX board-design , ethernet	4	749	April 5, 2023
Network connection loss when TX ring full Jetson AGX Xavier ethernet	18	3132	October 18, 2021
[Pegasue + DRIVE OS v5.1.0.0-13431798] Network interface disabled when massive data transferring loc... DRIVE AGX Xavier General	14	1272	October 12, 2021
Network interface disabled when massive data transferring Jetson AGX Xavier	18	1053	October 18, 2021
Multiple prx_desc errors on TX2 ethernet Jetson TX2	9	845	July 22, 2019
Jetson TK1 r8169 Ethernet cuts on high load Jetson TK1	17	6901	April 19, 2015
Ethernet Queue Building Up Jetson TX2	38	6010	November 1, 2017
TX2 (and TX1) network problems when used with two NICs Jetson TX2	1	1218	December 19, 2017
Jetsen TK1 stops responding with SSH connection Jetson TK1	12	2791	December 27, 2014
eth0 timeout on tx1 L4T R24.2.1 when bridged Jetson TX1	36	2620	May 3, 2017

Eqos 2490000.ether_qos eth0: eqos_start_xmit(): TX ring full for queue 0

Related topics