While streaming (4k + pointcloud) video from the Jetson Xavier NX I repeatedly ran into networking issues. Other threads on that issue exists, e.g.
but they are not showing any real solutions. I tried to fiddle around with queue sizes, real-time kernel process priorities etc. but nothing helps. The network hardware (eth0) stops working after issuing this error. A reboot or ip down/ip up helps, but is not really a solution for our scenario. Our setup:
docker container with argus/egl pipeline to gRPC → docker container gRPC traefik proxy → external client
The issue seems to be rooted in the eqos kernel module, where some condition seems to “kill” the kernel module in case of an unexpected “return NETDEV_TX_BUSY”.
Now I changed the TX_BUF_SIZE (increase) and EQOS_MAX_DATA_PER_TX_BUF to 8KB (commented the active “for testing purpose” line with 4KB) in nvidi/eqos/yheader.h, this seems to help → not sure how stable this is. Any thoughts on that?
today I found a few minutes to create a test app setup:
(1) network must be gigabit (with less it won’t crash the eqos)
(2) you’ll need traefik and a python3 container (see below)
I didn’t manage to create the issue without traefik in-between but I am positive, that the issue is not 100% related to traefik, as others observed it in other setups - probably traefik is doing “something” to the network packages which increased the chances of breaking stuff.
I’d need some more time to prepare an easy-to-launch environment, but here, you’ll find the python3 server/client which are flooding the ring-buffer (I run server on jetson behind traefik and the client on my laptop, single server, single client) and the traefik environment comes from dockerhub: traefik:v2.3 and configured like below. I think this should be easy to set up on your side?
For the sample server, you’ll need some additional labels in your docker-compose, you could use the 3dvl/jupyter docker image, as it has the python dependencies installed - or use any other python3 and install numpy and grpcio-tools manually.