Eqos dying quietly after TX ring buffer full message, lose all comms over ethernet

Hi,

We are encountering an issue with our Jetson Xavier NX on a custom carrier board using a kernel version based on L4T 32.4.3.

We are running an application that streams video from 3 cameras over a network using RTSP. Recently we changed the RTSP handling method slightly, where now our application connects to a docker image running rtsp-simple-server (GitHub - bluenviron/mediamtx: Ready-to-use SRT / WebRTC / RTSP / RTMP / LL-HLS media server and media proxy that allows to read, publish, proxy and record video and audio streams.), and our client application connects over the network to the docker container (previously we had no docker container, so the client connected directly to the application). What we are seeing now is that after a couple of hours of streaming, there are several kernel messages about the TX ring buffer queue being full, and some time after that all communication is lost over the ethernet interface. There is no ping to or from the machine, although ifconfig still shows an IP address. I can also see that the output of ‘ethtools -S eth0’ has the tx and rx packet no longer increasing. I can recover the eth0 interface by bringing it down and up again using ifconfig.

I’m not sure if any of this information is useful but: I have tried changing the size of the TX buffer using ethtools but it seems the eqos driver doesn’t have an interface for doing this. I can also see that the eqos driver has 8 buffers but only one is ever used (index 0). We have also tried running simple-rtsp-server outside of the docker container and see the same results.

Even if the TX buffer is genuinely full, I don’t expect this to completely kill all connection over the ethernet interface, so what’s going on? Any ideas on how to further troubleshoot this?

Related issue:

My 5-cents on that:

  • seems indeed somehow related, in our case the connection was completely dead as well.
  • we “fixed” our problem by increasing the eqos size in the kernel (we build custom images, so that was not an issue) → the ring-buffer size is hard-coded
  • we had similar problems in the past via network-streams, if the consumer is too slow consuming or the network itself is bottle-necking the transmission and a networking queue is building up → maybe slowing down the producer might help?

Best,

Axel

1 Like

Hi Axel,

Thanks very much for the suggestions! I will have a look at each one.

Could you confirm which variable you changed in the eqos kernel source? Was it in drv.c, and what value did you end up using?

We also came across a post suggesting turning TSO off which we are currently testing, did you try that? Network interface disabled when massive data transferring - #15 by askariz0503

Cheers,
Sam

Hi Sam,

we did not test the TSO thing.

As for the buffer-size variable: is has been a while, and I checked in the whole file in my git patch – so this is 90% certain:

eqos/yheader.h
#define TX_BUF_SIZE 24576 // 1536

Best,

Axel

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.