Moving the discussion from DRIVE AGX section to Xavier:
Is there someone able to help with that issue?
Moving the discussion from DRIVE AGX section to Xavier:
Is there someone able to help with that issue?
Hi atanas,
Could you share your setup and how to reproduce this issue?
The link of your previous thread does not tell us how to reproduce but only some error log.
Let me quote the important parts from the previous thread.
We got unpredictable freezing of the internal network adapter under heavy load and last messages in the kernel log are exactly the same as the author of the other thread initially report:
My setup and (not tested but likely to work) instructions to reproduce:
atanas,
Sorry for late reply. Could you share some steps that can directly reproduce this issue?
It could save both our time. If I try to send a massive data but does not hit this issue, I still have to ask you again. So it is better clarifying first.
I think that you will hit the issue if you try to download few gigabytes from the Xavierās build in ethernet port whatever the protocol is - http, ftp, ssh ⦠I ill try to confirm that right now.
Do you mean if I put a video file or something else like a whole jetson drvier package tarball at ftp server and download to tegra, I can see the ethernet driver error?
Yes ⦠Iām trying to reproduce the issue with generating random file from /dev/urandom but work remotely today and it will take a while ⦠give 10 minutes and will get back to you.
Sorry ⦠canāt reproduce the problem - seems that it is triggered by our setup.
We are also experiencing this error:
kernel: [ 4732.208714] eqos 2490000.ether_qos eth0: eqos_start_xmit(): TX ring full for queue 0
The error occurs when transmitting large volumes of data from a Xavier dev kit to another computer, but the conditions to trigger the fault are more complex than just the volume of data. We use ROS and encounter the fault when sensors that are connected to the Xavier are processed on the Xavier and then sent to another machine via ROS using multiple topics so there are multiple active TCP connections. The fault also depends on the capabilities of the remote ethernet interface; it doesnāt seem to occur when using USB to ethernet adapters on the remote end.
By recording the ROS data we have been able to consistently reproduce the fault using two Xavier dev kits, one to play the bag file and the other to record it again. I can provide the ~600 MByte bag file and instructions on how to use it to reproduce the bug if desired.
Just trigger the issue several times in a row on my test setup with intensive traffic of 4-5mb files.
Every time the solution was to power cycle my Xavier - something completely not acceptable for production environments.
I see that a āForum Adminā accepted the previous reply as accepted answer and donāt know how to interpret that ⦠@WayneWWW did you or someone else at nvidia succeed to reproduce the issue?
atanas,
Sorry that I didnāt verify this issue because lacking of enough information and you told me it seems only happen to your local setup.
Could you try to repro it on devkit + native l4t driver package from sdkmanager?
This is exactly how I work (if I got your request right of course).
I got the issue on Xavier Devkit upgraded to the latest firmware with sdkmanager.
In the beginning I was assuming that the issue is triggered by transferring big files but yesterday it happen during intensive transfer of small files i.e. seems that it happen when we got close to the max transmit bandwidth of the buildin network interface.
Could you provide a script or tool you are using to transfer files? or share more detail about how āintensiveā your transfer is.
Hope there is a quick way to reproduce this issue on my side. Thanks.
Sorry for the late reply. What I could do is to send you our firmware app snap file together with some data to reproduce that issue. But probably will have time after the holidays early next year.
@atanas @WayneWWW
I have came across with such problem . Have you solved it yet?
Here is my reprodue procedure .
on the xavier side :
1.setup ros enviroment
2.play bag with test data which contains images etcā¦
rosbag play image_cloud.bag -l
on the host side :
rosbag record -a
the no longer than 1mins the problem show .
btw here is test bag link address .image_cloud.bag - Google Drive
Hi askariz0503,
I am sorry that we didnāt resolve this issue because we donāt know how to reproduce it.
Such issue seems need to reproduce by some 3rdparty tools. For example, you shared a rosbag tool with us which is from ROS. Is it okay to use it on pure jetpack setup?
@WayneWWW
I haveāt tested with pure jetpack setup. but after I disable the tso function wich
sudo ethtool -K eth0 tso off
the problem disappear .
Hi @askariz0503,
In our design we need more ethernet ports so we added PCI extension board and didnāt use the build in - that way the problem didnāt occur any more. I didnāt have time to isolate simple way to reproduce it back then but since we open source that project could prepare something now.
If someone is interested could try to reproduce it building and installing the snap package from this repo on clean Xavier setup: