I can run multiple deepstreem_test_3.py (up to 7 pipeline with each feeding 4 video files) without crash the system (or causing it self reboot) if I disconnect ethernet network on AGX Xavier (by clicking top right corner network icon and select disconnect right below wired connection 1) .
However, if I reconnect the network and run the above same multiple deepstream_test_3.py then the system crash (self reboot).
Steps to duplicate the crash (self reboot):
turn on AGX and make sure network is connected
set MAXN mode, set fan at 255
run 5 to 7 copies of deepstream_test_3.py feeding 4 videos each (the more copy to run the easier to duplicate the problem)
go to PC (running ubuntu 18.04) and “ssh agx.local” to connect to the AGX and then run tegrastats in background to log the status every second, then use “tail tegralog” to view the log frequently
around the 5th or 6th copy of deepstream_test_3.py running, the system crash (then self reboot)
Background: it has been a long way to lead to this path. Initially I suspect my power supply voltage swing, so I add a 600W line conditioner to eliminate the power issue. Then I suspect it is thermal issue, but check the tegrastats, the GPU temperature never exceed 47C, of course other CPU, thermal temperature are lower than 47C. Eventually thanks to linuxdev pointed out in one of my self-rebooting logs actually the network causing the self reboot! And this lead to this post of showing how to duplicate the issue. Attached please find the serial console log 7_run4_network_on_crash.log (233.4 KB) and tegrastats log 7_run4_network_on_crash_tegrastats.log (80.6 KB)
when the system crash. Be aware that the network error may not always show up in the console log. But so far whenever the network is on, the system is not stable. I have been changing two different routers, the result is the same: Network on, system very easy crash when running multiple pipelines. Network off, the system is very solid so far.
Question: my product need to turn on ethernet to transmit the result in real time, now whenever the network is on, the system is not stable (kept self rebooting), how can we overcome this issue? Plus WiFi is not a solution for our product. Please help. Thanks a lot in advance.
This morning after running few apps and everything seems normal. But leave the unit on with network connected, after few hours later even without running any thing, the unit self reboot…
Hi alanz, I am curious what’s your duplication environment:
do you connect AGX to a monitor (display) or headless? if it’s headless, what do you use to connect the unit? ssh? or VNC?
what’s the power mode? MAXN? or other?
did you run “sodu jetson_clocks” before your testing?or not?
did you run any apps in this 6 hours?
what’s your JP version “head -1 /etc/nv_tegra_release”?
what’s the GPU temperature during the running?
did you ever encounter “INFO: rcu_sched detected stalls on CPUs/tasks: 0” during 6 hours?
Thanks for these information. This can calibrate between what’s the difference between your system vs. ours.
Attached more self reboot console log last night (it constantly happened) multiple_self_reboot.log (529.7 KB)
When self rebooting constantly happens, I noticed few things:
GPU/CPU/thermal temeprature > 35C (even running no apps) in 28C room temperature.
CPU 1 loading > 98% almost always at 100% don’t know what’s running although the unit does not run any apps.
the unit will go into a mode that constantly reboot itself every few minutes. And I have to shut it down by pulling the plug and leave it overnight (I cannot work on this unit anymore…)
This morning, when I turn on the unit, all CPU/GPU/thermal < 32C, CPU 1 loading < 10%, everything seems stable and normal.
What does this imply? I have been suspecting the thermal sensitive of this unit for a long time, but never can “duplicate/nail it” in a solid way, when it happens (self rebooting), then it happens consecutively… and need to wait to next day to “clear” it up. Very strange behaviour. (basically it’s not usable anymore…; (
I think you are on to something with the temperature.
The default fan setting is quiet which has a trip temp of 46C. I changed the setting to cool which has a trip temperature of 35C with: sudo nvpmodel -d cool
Since then, the devkit has been playing youtube HD full screen videos non-stop with no issue.
Here is the latest tegrastats:
RAM 2440/31925MB (lfb 6939x4MB) SWAP 0/15963MB (cached 0MB) CPU [31%@2265,27%@2265,22%@2265,24%@2265,31%@2265,38%@2265,36%@2265,43%@2265] EMC_FREQ 0% GR3D_FREQ 28% AO@34C GPU@34.5C Tdiode@36.5C PMIC@100C AUX@34C CPU@36C thermal@34.95C Tboard@34C GPU 619/670 CPU 4183/3586 SOC 2788/2544 CV 154/154 VDDRQ 929/897 SYS5V 2564/2474
For your case, could you give us a summary of how many issues you’ve filed?
It looks like all of them are connected but not separate issues…
For example, I saw you have below topic too. Plus the previous “power supply” issue I saw. You’ve filed 3 topics and all of them are same to me.
As I pointed out in the power supply topic, you always see kernel panic before system reboots. And that kernel panic is from ethernet driver. That is also connected to this topic.
Thus, please stop filing new topics. We can use this one to track.
It links to all the posts I had filed on this issue. It seems all the issues so far I had filed linked to one symptoms (not the root cause) which the CPU 1 loading is inching up all the way to 100% overtime or near to 100% and then crash.
It could be (my guessing) some part the system keep firing irq and inundate the CPU (that is the load is getting higher and higher over time). The suspected part (could be s/w or h/w) are:
power management: bpmp, etc.
network: eqos, etc.
gpu : nvgpu, etc.
others,
eventually causing CPU stalled, then kernel panic - not syncing: softlockup
Well that’s my two cents guessing, but no clue what causing these symptoms. My setup is extremely simple (Display+keybord+mouse+ethernet) no other sensors. The unit uses the 65W power supply come with the product and plug into a 600W line conditioner exclusive for AGX Xavier only (no other device plug in). Power mode setting MAXN and “sudo nvpmodel -d cool” to keep the fan running. The system can still self reboot without any apps running. Yesterday for example, turn on around 9:00am, self reboot around 12:15 noon, then 2nd self reboot around 12:45pm (still nothing running), then 3rd self reboot around 1:15pm (still no apps running) and 4th self reboot around 4:30pm. All the console logs and tegrastats logs can be found in this post
I found a way to crash the Jetson AGX Xavier DevKit:
1 - run all executables in : /usr/src/nvidia/graphics_demos/prebuilts/bin/x11
2 - spread them nicely to occupy the whole screen
3 - wait 30 minutes
I was running a couple of utilities from a remote station at the same time so here is the last logging: