Network interface disabled when massive data transferring

Moving the discussion from DRIVE AGX section to Xavier:

https://devtalk.nvidia.com/default/topic/1051433/general/-pegasue-drive-os-v5-1-0-0-13431798-network-interface-disabled-when-massive-data-transferring-loc-/post/5397103/#5397103

Is there someone able to help with that issue?

Hi atanas,

Could you share your setup and how to reproduce this issue?

The link of your previous thread does not tell us how to reproduce but only some error log.

Let me quote the important parts from the previous thread.

We got unpredictable freezing of the internal network adapter under heavy load and last messages in the kernel log are exactly the same as the author of the other thread initially report:

My setup and (not tested but likely to work) instructions to reproduce:

atanas,

Sorry for late reply. Could you share some steps that can directly reproduce this issue?
It could save both our time. If I try to send a massive data but does not hit this issue, I still have to ask you again. So it is better clarifying first.

I think that you will hit the issue if you try to download few gigabytes from the Xavier’s build in ethernet port whatever the protocol is - http, ftp, ssh … I ill try to confirm that right now.

Do you mean if I put a video file or something else like a whole jetson drvier package tarball at ftp server and download to tegra, I can see the ethernet driver error?

Yes … I’m trying to reproduce the issue with generating random file from /dev/urandom but work remotely today and it will take a while … give 10 minutes and will get back to you.

Sorry … can’t reproduce the problem - seems that it is triggered by our setup.

We are also experiencing this error:

kernel: [ 4732.208714] eqos 2490000.ether_qos eth0: eqos_start_xmit(): TX ring full for queue 0

The error occurs when transmitting large volumes of data from a Xavier dev kit to another computer, but the conditions to trigger the fault are more complex than just the volume of data. We use ROS and encounter the fault when sensors that are connected to the Xavier are processed on the Xavier and then sent to another machine via ROS using multiple topics so there are multiple active TCP connections. The fault also depends on the capabilities of the remote ethernet interface; it doesn’t seem to occur when using USB to ethernet adapters on the remote end.

By recording the ROS data we have been able to consistently reproduce the fault using two Xavier dev kits, one to play the bag file and the other to record it again. I can provide the ~600 MByte bag file and instructions on how to use it to reproduce the bug if desired.

Just trigger the issue several times in a row on my test setup with intensive traffic of 4-5mb files.

Every time the solution was to power cycle my Xavier - something completely not acceptable for production environments.

I see that a ‘Forum Admin’ accepted the previous reply as accepted answer and don’t know how to interpret that … @WayneWWW did you or someone else at nvidia succeed to reproduce the issue?

atanas,

Sorry that I didn’t verify this issue because lacking of enough information and you told me it seems only happen to your local setup.

Could you try to repro it on devkit + native l4t driver package from sdkmanager?

This is exactly how I work (if I got your request right of course).

I got the issue on Xavier Devkit upgraded to the latest firmware with sdkmanager.

In the beginning I was assuming that the issue is triggered by transferring big files but yesterday it happen during intensive transfer of small files i.e. seems that it happen when we got close to the max transmit bandwidth of the buildin network interface.

Could you provide a script or tool you are using to transfer files? or share more detail about how “intensive” your transfer is.

Hope there is a quick way to reproduce this issue on my side. Thanks.

Sorry for the late reply. What I could do is to send you our firmware app snap file together with some data to reproduce that issue. But probably will have time after the holidays early next year.

@atanas @WayneWWW
I have came across with such problem . Have you solved it yet?
Here is my reprodue procedure .
on the xavier side :
1.setup ros enviroment
2.play bag with test data which contains images etc…

rosbag play image_cloud.bag -l

on the host side :

  1. just record all the message with

rosbag record -a

the no longer than 1mins the problem show .

btw here is test bag link address .https://drive.google.com/file/d/150zo_ImSSfIF2J3Rjt_pZM4sTu7Loz6c/view?usp=sharing

Hi askariz0503,

I am sorry that we didn’t resolve this issue because we don’t know how to reproduce it.
Such issue seems need to reproduce by some 3rdparty tools. For example, you shared a rosbag tool with us which is from ROS. Is it okay to use it on pure jetpack setup?

@WayneWWW
I have’t tested with pure jetpack setup. but after I disable the tso function wich

sudo ethtool -K eth0 tso off

the problem disappear .

1 Like

Hi @askariz0503,

In our design we need more ethernet ports so we added PCI extension board and didn’t use the build in - that way the problem didn’t occur any more. I didn’t have time to isolate simple way to reproduce it back then but since we open source that project could prepare something now.

If someone is interested could try to reproduce it building and installing the snap package from this repo on clean Xavier setup: