Strange output pattern - occasional crash - continued

With reference to this closed thread: Nano occasional crash - strange output pattern

This happened to me again today on a new Nano 4Gb dev kit, on a different display.

I have seen the Nano crash with this output pattern on many occasions, on many devices. Does the pattern give a clue as to the reason why the device crashed? It happens once every few weeks, with any of my Nano based devices. I have no idea of the cause, it occurs at seemingly random intervals. Hope you can help!

I should add, that overall the image looks grey, this is a close up of the pixels on the screen. This image represents around 10% of the width of a 1080p output.

Another important piece of information is when this happens, if I connect through SSH I can close our application and restart it, but the grey pattern persists. like it overlays everything on the desktop.

FYI I have seen this happen on both:
Nvidia Nano developer kit booting from microSD
Nvidia Xavier NX plus Auvidea JNX30 Carrierboard booting from internal eMMC with additional M.2 SSD

I discovered that :

sudo pkill X

does get rid of the grey pattern, and returns me to the login page on the device

this doesn’t really solve my problem, but i guess it gives us a hint where the issue is coming from?

I also discovered this time round that our application is definitely still running just fine underneath this output. I am able to stream the display buffer from our application over the network to another computer and our application is running fine, but this pattern is displayed on the output.

I have attached the x-org log as suggested by linuxdev in the previous thread, the fault occurred around 5 minutes after boot, this full screen grey output pattern was displayed for around 10 minutes and then it went away, and back to our application which was running fine.

I have left the device running and I can SSH into it.

The fault occurred again at 6.AM, around 45 minutes after boot

Clutching at straws I tried opening htop, looked at what else was running on the device and tried killing a few things. I did the following:

$ sudo killall clipit
$ sudo killall haveged
$ sudo killall openbox

I then touched the mouse and the pattern went away… So embarrassingly it seems like it might be a screen saver of some sort, I thought the screen saver was disabled in our setup scripts. Maybe something in the setup scripts is failing. We do not usually have a mouse or keyboard connected to the device, only by chance we had one connected this time. Normally our only access to the device is through SSH.

I am still confused why this only happens intermittently, and not all the time. But at least I am getting a little closer to establishing what the cause of the problem might be.

Hi,

You already filed the same topics for 3 times.
I have to tell the truth that this issue won’t have any solution if you keep filing same issue without really providing useful information.

We need your sample code to reproduce this issue because there is no other users ever told us they saw such pattern on their board. Thus, your application may be the key to reproduce this issue.

If you want me to give out some analysis, I can only tell below

  1. Killing X can remove the pattern only tells that this weird pattern is using Xorg to render. The display manager is based on X. Lots of our sample code are based on X too. We can never know which one is causing problem. What funny is we still don’t know if your application is based on X or not after you filed 3 duplicated topics. Could you tell this or not?

  2. You said this issue happened on Nano and NX. Then I can say this issue may not related to kernel.

  3. In your previous topics, you mentioned you are using LXDE. And you said you would try gdm3. I think this is a right direction to debug since there are not many users trying on LXDE, so maybe this issue is really from LXDE.
    However, where is the result of gdm3 now? You just come back and file almost the same content as previous topics. I really don’t know why you are doing such loop again and again.

I don’t know how large is your sample code. For most users who said they cannot share their sample code, they may give out a simplified version. For example, if your sample depends on some camera driver which you cannot share, then feed dummy buffer to the sink and see if you can reproduce, if you can, then share that simplified version.

FYI, that background probably means the X server is running without a window manager. Raw X without a desktop or login manager tends to look like this.

Hi Wayne

I’m sorry if my reposts have annoyed you. I wanted to continue to discuss the issue (as it is still occurring for us) but the original posts get locked after 2 weeks. The nature of this issue is that it is very intermittent, so sometimes I don’t see it for a few weeks, and with holiday season as well I have not been using the system.

I understand your frustration, I would love to simply share the source code with you and be able to describe the method to reproduce the issue to you. But unfortunately that is not how this is going to work. In order for me to create a cut down version of our application which reproduces the issue, I would first need to know how to reproduce the issue. At this time the issue happens at random intevals with no association to anything that is happening in our application. Also, the issue occurs outside of our application, so I am sure our source code would not be of any use.

Also your suggestion to use gdm3 although much appreciated, does not work for us. The reason we switched to LXDE was because the gdm3 desktop creates numerous notifications that appear above full screen top level windows. This is un acceptable for our application. With LXDE I was able to avoid this problem. I would actually prefer to use gdm3 as it is a much better UI in my opinion, but having spent a few days trying to work out how to prevent all possible notifications appearing above full screen top level windows, I ended up changing to LXDE reluctantly.

As I wrote in my last post, I am getting closer to the source of the problem. It seems to be related to some sort of screensaver. I am currently running soak tests on a system where I have run:

sudo apt -y purge xscreensaver

After installing LXDE in the device setup script. Hopefully this will resolve the issue, but only time will tell, as the issue was extremely intermittent.

I do not know for sure that xscreensaver is causing the issue, i just know that when i moved the mouse the issue went away (we don’t normally have a mouse connected to the device in this application). There could be some other screensaver type application running. I still I do not understand, why the issue is so intermittent, it can run for days and days without exhbiting the problem, but then out of the blue this occurs. If it is related to xscreensaver, which was previously disabled in the device setup scripts, I do not understand what is occasionally enabling it.

I will suggest that you save the “/var/log/Xorg.0.log” when things are working as you expect. Then, while it is failing (it is important to not reboot), get a copy of the new “/var/log/Xorg.0.log”. You might also want to look at the tail of “dmesg” and see if the event logged something related to the failing of the window manager. It is extremely difficult to debug something like this which is basically “it crashed” and there is no detail from a log.

I can tell you something which might be related. Long ago very few Linux systems had OpenGL support unless the owner added this manually. One could pick from a number of screensavers. Some of those were OpenGL, and if one of those popped up on a system without OpenGL support, then the window manager would die just as you mention. If there is some sort of correspondence with either low power mode or with a screen saver, then you could purposely set the screen saver time to occur as often as possible (e.g., perhaps once per minute) and get more test cases. You could even change which screensaver is used to get more tests against that one.

1 Like

If the device you’re connecting to Nano has x11 graphical support available, you could try (within trusted, secure networks, depending on OS) ssh -X (or ssh -Y) what transfers graphical output of remote applications to that host graphical system ( limitations may occur, because of network and host system performance constraints ) for ‘debugging’ that graphical output pattern.

2 Likes

One downside of using ssh forwarding (the “-X” or “-Y” options) is that all software and hardware for GPU must be on the host PC. As an example, if you are using a hardware accelerated OpenGL application (including a screensaver) which is not supported on the Jetson itself, then forwarding to a PC which does have the required software/hardware would make the application suddenly work. Or the reverse, if the software works on the GUI of the local Jetson, but the PC does not have supporting hardware or software, then forwarding will crash on the PC. It makes for some interesting GPU performance testing if you are using forwarding and don’t recognize that instead of using the Nano GPU you are using something like one of the Titan series :P

I don’t know if it is possible to forward a screensaver, but it is an interesting thought for testing.

Thanks for the feedback guys. Much appreciated.

FYI linuxdev, i did take a look at the Xorg.0.log file you had suggested on the previous thread. And there was nothing of interest around the time the issue was occuring.

Hopefully the issue was just xscreensaver trying to start and failing in some way due to a conflict with our open gl application. Time will tell.

Thanks for your help anyway!

Setting a script to run with looking for window manager (lxdm? openbox?) available on Nano OS, if not or if xscreensaver starting, script copies dmesg, ps, journalctl output (for some seconds/minutes) to error log files?
While running all time until a possible x11 error occurs, this should be low impact on system resources.

OK I think I found the solution.

I added this:

xset s off && xset -dpms

to the end of ~/.profile

it seems like the issue was to do with dpms

I had a device exhibit the issue again and via SSH i entered the xset -dpms command and the issue went away immediately. So hopefully dpms was the cause of the problem!

Can we have a conclusion that this pattern may show up when display is going to sleep/wake up with LXDE?

If this is true, can we know is it still hard to reproduce or it would be easier?

Hi Wayne

I am pretty confident this problem is now resolved.

If i run the command

xset +dpms

which enables the dpms, then leave the device for a while, eventually the problem occurs.

if i run the command

xset -dpms

which disables the dpms, then leave the device for a few hours the problem does not show up.

So I’m pretty sure this is resolved now.

Thanks very much for your help.

1 Like

It is interesting since low power mode and related software (such as USB suspend mode) have a high rate of being an issue throughout all of Linux. I am not surprised that a low power mode is issued upon screensaver starting, but waking up some part (in this case the window manager) sometimes fails. The part which is really curious though is whether or not (A) the window manager itself failed to wake up, or if (B) the window manager never received the wake up message. Either way you’d end up with a raw X display and no window manager.