Eqos 2490000.ether_qos eth0: eqos_start_xmit(): TX ring full for queue 0

While streaming (4k + pointcloud) video from the Jetson Xavier NX I repeatedly ran into networking issues. Other threads on that issue exists, e.g.

but they are not showing any real solutions. I tried to fiddle around with queue sizes, real-time kernel process priorities etc. but nothing helps. The network hardware (eth0) stops working after issuing this error. A reboot or ip down/ip up helps, but is not really a solution for our scenario. Our setup:

docker container with argus/egl pipeline to gRPC → docker container gRPC traefik proxy → external client

The issue seems to be rooted in the eqos kernel module, where some condition seems to “kill” the kernel module in case of an unexpected “return NETDEV_TX_BUSY”.

Now I changed the TX_BUF_SIZE (increase) and EQOS_MAX_DATA_PER_TX_BUF to 8KB (commented the active “for testing purpose” line with 4KB) in nvidi/eqos/yheader.h, this seems to help → not sure how stable this is. Any thoughts on that?

Hi,
Could you share your application for us if we want to reproduce your issue locally?

And can you at least share us the full dmesg log?

Hi Wayne,

today I found a few minutes to create a test app setup:
(1) network must be gigabit (with less it won’t crash the eqos)
(2) you’ll need traefik and a python3 container (see below)

I didn’t manage to create the issue without traefik in-between but I am positive, that the issue is not 100% related to traefik, as others observed it in other setups - probably traefik is doing “something” to the network packages which increased the chances of breaking stuff.

First of all: the dmesg (79.2 KB)

I’d need some more time to prepare an easy-to-launch environment, but here, you’ll find the python3 server/client which are flooding the ring-buffer (I run server on jetson behind traefik and the client on my laptop, single server, single client) and the traefik environment comes from dockerhub: traefik:v2.3 and configured like below. I think this should be easy to set up on your side?

version: "3.5"
services:
  proxy:
    image: "traefik:v2.3"
    container_name: "traefik"
    restart: always
    command:
            #- "--log.level=DEBUG"
            #- "--api.insecure=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.grpc.address=:8888"
    networks:
      - mynet
    ports:
      - "80:80"
      - "8888:8888"
      #- "8080:8080" # dashboard
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
    
networks:
    mynet:
        external: true

For the sample server, you’ll need some additional labels in your docker-compose, you could use the 3dvl/jupyter docker image, as it has the python dependencies installed - or use any other python3 and install numpy and grpcio-tools manually.

         labels:
            - "traefik.enable=true"
            - "traefik.http.routers.stereo.rule=Headers(`app`, `stereo`) && Headers(`content-type`, `application/grpc`)"
            - "traefik.http.routers.stereo.entryPoints=grpc"
            - "traefik.http.services.stereo.loadBalancer.server.port=51346"
            - "traefik.http.services.stereo.loadBalancer.server.scheme=h2c"

Hope that helps? If you need further assistance to reproduce, please let me know.

Best,

Axel

Hello,

We don’t have experience in traefik container and docker. Could you

  1. try to repro this issue with local L4T w/o any container.

  2. If (1) is not feasible, please provide step by step setup method for us. In case we do anything wrong.