Cold start ethernet problems

Hello,
I’m having some strange problems with our xavier. The setup i’m running is a xavier connected with 6 ethernet cameras. through a gigabit poe switch. they each take two pictures per second in 1080x1200 resolution. I’m running jetpack 4.3, and have also tried to reflash the system.

When the xavier has been started after being off power / offline for a while (1h +), the camera program starts experiencing issues with recieving images slowly untill it crashes. With ifconfig starting to show RX errors and overruns.
This only seems to occur when the system has been off power.

I first thought it might be a camera or switch issue, but the error only seem to occur with the xavier, with ordinary computers not showing this issue.

  1. Any error in dmesg?
  2. Any error if you just use < 6 cameras?
  3. Please upgrade to rel-32.4.3(jp4.4)
  4. What “off power/offline” are you talking about ? The monitor goes into power save?
  • I have since rebooted, but will check this again when i can reproduce the error.

  • I have also tried with 4 cameras, with the same problem. It just takes longer to occur. (~7 min vs ~5 min)

  • Due to the changes in the cuda library in 4.4 and how it integrated with the docker image, i have not been able to migrate yet. Libtorch could not find the cuda libraries.

  • If i shutdown the system for more than an hour. The next immediate reboot might also show the issue, but after that the issue is non existent untill i then wait for an hour+ to power on and boot the system.

You mentioned “docker”. Do you mean you don’t use pure jetapck installation on devkit directly?

I use the pure Jetpack as installed with the SDK manager, we just run the program in docker based on the L4T image. I have alse earlier tested it outside of docker, but the error still occurs.

Hi,

Ok. I see.

If i shutdown the system for more than an hour. The next immediate reboot might also show the issue, but after that the issue is non existent untill i then wait for an hour+ to power on and boot the system.

This sounds a unclear description. I think you may tell us what is your definition of shutdown and reboot. I mean if power off + reboot would lead to such issue, then you shall always have this issue.

For example, I power off the device today, remove the power cable. Come back and power on the device again after 2 days Does this case match your “power off/offline” definition? Will it hit issue in such case?

Is this application run right after each reboot?

Hi,
Your last example matches the description. With rebooting i’m meaning calling the reboot command from the command line. And shutdown includes calling the shutdown signal, and removing the power cable.

It will not always run correctly after the first reboot. But for the next reboots again, the problem does not exist.

The problem might also not depend on amount of reboots, but time elapsed from powering on.

but time elapsed from powering on.

How fast will you hit this issue right after boot up? Have you taken this as a variant?

We launch the program automatically as a service on boot, ill estimate that it take from 4-7 minutes for the problem to start occurring.

Any specific application to reproduce this issue? We would like to reproduce this issue with our PoE switch and cameres.

We are using a program made with the spinnaker SDK for controlling FLIR cameras.

That being said, i now tested another Xavier we had available. And the issue does not seem to appear on this one.
Maybe there is something wrong with a component with the first one? Not sure if this matters, but the one showing the errors is powered by 12V and the one not showing them is powered using the supplied 19V adapter.

Are those two not nvidia devkit?

They are.

Then could you also try 19V power on the problematic device?

I will test that as well.

The problem still occurred on the problematic Xavier on 19V.

And any dmesg error in dmesg or syslog?

Not that i can see that obviously relates, ill attach the dmesg and journalctl dumps.

dmesg.txt (68.2 KB) journalctl.txt (187.1 KB)

Hi,

Back to previous question again, how fast is your application going to run after booting up?
Just wonder if you launch the application later could enhance this or not.

And I still suggest you to upgrade to rel-32.4.3. It is just for debug purpose.

Hi,
The application is set to launch after the network becomes available, which makes it take less than 30 seconds to start at the moment, and fully launch in about 1 minute.

We want it to be able to start relatively fast, but increasing it to 2-3 minutes would not pose a problem. Although I’m not sure if the problem ceases to be an issue in this time frame.

I will test with upgrading it to rel-32.4.3 later.