Occasional Crash - strange output pattern

With reference to this closed thread: Nano occasional crash - strange output pattern

This happened to me again today on another Nano device, on a different display.

I have seen the Nano crash with this output pattern on many occasions, on many devices. Does the pattern give a clue as to the reason why the device crashed? It happens once every few weeks, with any of my Nano based devices. I have no idea of the cause, it occurs at seemingly random intervals. Hope you can help!

I should add, that overall the image looks grey, this is a close up of the pixels on the screen. This image represents around 10% of the width of a 1080p output.

Another important piece of information is when this happens, if I connect through SSH I can close our application and restart it, but the grey pattern persists. like it overlays everything on the desktop.

FYI I have seen this happen on both:
Nvidia Nano developer kit booting from microSD
Nvidia Xavier NX plus Auvidea JNX30 Carrierboard booting from internal eMMC with additional M.2 SSD

I discovered that :

sudo pkill X

does get rid of the grey pattern, and returns me to the login page on the device

this doesn’t really solve my problem, but i guess it gives us a hint where the issue is coming from?

What we want to know is how can we reproduce this issue with our devkit?

Since you said this issue could happen on Nano and NX custom board, I don’t think it is random bug.

What kind of application are you running ? Also, what is the jetapck release you are running now?

I wish i knew how to reproduce it with the devkit! I would have much more chance of solving it myself if I could.

I am running our opengl based application, which normally runs just fine, but occasionally this happens.

Sometimes soon after starting the device (and our application) other times it will run for days and days and this won’t happen at all.

it seems like a very specific pattern, something that I am sure our application is not generating.

I can still access the device via SSH when this occurs, but I cannot use the device as the screen is filled with this pattern.

a power cycle gets everything running again just fine. But I would really like to know what causes this pattern.

Re Version:

nvidia@hive-9994:~$ cat /etc/nv_tegra_release

R32 (release), REVISION: 4.3, GCID: 21589087, BOARD: t210ref, EABI: aarch64, DATE: Fri Jun 26 04:38:25 UTC 2020

It is not really an option to update to the latest jetpack as we have quite a lot of these devices in use already

With the information, we cannot tell what is going on.

If this could be workaround by killed the Xorg, then I think it is still software side problem. Just a bug that has low reproduce rate.

You can try to do more test like testing this over other jetson nano devices and see if this issue would happen. This can prove that it is not due to broken hardware.

Also, try to upgrade the system. It is just for debug, I am not asking to you totally move to latest release for your product.

it is not a bug with our application, because I can kill our application via SSH and then restart it and the pattern persists overlaying anything that happens on the device.

I am not blaming that this is the bug from your application. My point here is you can firstly check if this issue is triggered by your application. Triggering some bugs do not mean your application itself has bug.

This is a totally new issue that no one ever reported before, so issue will not be resolved by just giving out few comments and we will know how to solve it.

Please clarify the situation first

  1. What is the exact situation to hit this problem? I mean if you don’t run your application, will this error happen automatically?

  2. Do you have other jetson nano to validate this issue? This can prove whether this is hardware issue or not.

  3. How is the reproduce rate of this issue? I mean if you run your application 1000 times, how many times will hit this?

  4. Will this issue happen if you use rel-32.6.1? Again, this is just for debug. If you cannot run a gl application on rel-32.6, please tell me the reason.

  5. Is it possible to share your code to us so that we can try to reproduce this with our device?

You might want to ssh in and monitor the Xorg log as this occurs. I’ll assume you are using a DISPLAY of “:0”, but adjust if using something like “:1”. On a separate ssh terminal run this:
sudo tail -f /var/log/Xorg.0.log

Note what the output stops at once your application is running normally. Then, after the problem hits, but before killing it, note if something has been appended to the log.

linuxdev - thanks for the suggestion. Much appreciated - I will try that next time it happens

WayneWWW - thanks for giving this some further thought, here are some answers to your questions…

  1. What is the exact situation to hit this problem? I mean if you don’t run your application, will this error happen automatically?

The error happens a few minutes after the devices are started. it is rare. sometimes I will go for weeks without seeing it. then sometimes it will happen a few times in a row after the device boots. Our application is running from sh script acting as a systemctl service.

  1. Do you have other jetson nano to validate this issue? This can prove whether this is hardware issue or not.

I have a few of them yes, I have seen it happen on many different devices.

  1. How is the reproduce rate of this issue? I mean if you run your application 1000 times, how many times will hit this?

I would estimate in a thousand runs of the device it would happen around 40 times.

  1. Will this issue happen if you use rel-32.6.1? Again, this is just for debug. If you cannot run a gl application on rel-32.6, please tell me the reason.

I will try to setup a new jetson release for comparative testing.

  1. Is it possible to share your code to us so that we can try to reproduce this with our device?

I will look into this too.

Thanks and best wishes

I have a few of them yes, I have seen it happen on many different devices.

With this answer from you, I guess we need your sample code to reproduce this issue and see what is going on. This is probably just software side problem.

Also, you can try to disable this systemctl service first to confirm if this issue is triggered by your application.
Frankly speaking, I guess it is. We have lots of users here who conduct reboot stress test. They don’t report any of such issue after thousands of reboot iterations.

Just a thought: This might be a timing issue as to when the service starts. Perhaps there is some other service which the program running on depends on, and normally the other service is up and running in time. Such services might change timing on occasion during boot. If this is the case, then you might see if your program possibly has a need of another service which the systemd script needs to mark as a prerequisite to guarantee which order the service starts in. I base this on the idea that this is “sometimes” after a boot, versus “sometimes after the system has been running for some time”. That sounds like timing.

Hi,

Almost one week has passed. Are you able to share the source code of your program?

This issue won’t have any progress unless you share out the sample for us to check.

Hi Wayne. Thanks for checking in. Much appreciated.

I am not going to be able to provide the source for our application. But I will continue to monitor the systems to try to find the cause of the problem. If I can find the source of the problem then I will try to provide a cut down application to reproduce the error, if possible.

But I haven’t seen it happen since we spoke last. So its pretty tricky to trace. I will be trying the suggestions by linuxDev re: xorg log, next time it happens and I will report back.

I also posted here incase that attracted any interest. We are using LXDE as our desktop so maybe it is something connected with that.

https://forum.lxde.org/viewtopic.php?f=8&t=40879

If this is a 4GB nano, you can also try with default gdm3. But since it looks like this issue takes really long time to just reproduce once. I will leave option to you to decide.

Thanks for the suggestion. Any tips on how to get gdm3 going? or is that just the name of the default desktop that ships with the jetson platform?

If it is the default one we already tried that and we have issues with it for other reasons. we like the lean nature of lxde

More precisely, the default desktop manager is gdm3 and the desktop is gnome.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.