Boot fails with looped message: "systemd-journald[2405]: /dev/kmsg buffer overrun, some messages lost"

I have customized board which includes 4 Xavier NX.
When this board is connected to one of our specific host systems, Linux completes its boot process. (Linux is working as desired).
When the same board is connected to another host system, it seems like the boot procedure is stuck on an endless loop of:
“systemd-journald[2405]: /dev/kmsg buffer overrun, some messages lost.”

Attached is a log printout
stand_alone.txt (36.0 KB)

Can anyone look at the attached file and guide me on what should I do in order to debug this issue ?

Thanks,
Eran.

Hi eran.peled,

What’s your Jetpack version in use?

Could you also share this log with boot successfully?

Why your board has the different boot behavior when it connects to different host?
Are you boot it from network?

How do you flash the 4 Xavier NX modules on your custom board at the same time?

Hi Kevin,

Thanks for the quick replay.

4.4

This is a very important piece of data, which I also trying to have as the other “working” system is not available.
I do have a 3rd Host system that includes our card (but it is a lab jig), which this Linux also works just fine. Log is attached.

working_jig.txt (42.8 KB)

The boot is the same. both from the internal eMMC. no network boot.

I have a DIP switch which routes the USB to the specific module.

It’s quite an old release. Could you help to update to the latest R32.7.4 and verify?

It seems there’re also many error messages including display/I2C/USB…etc.

Could you share the block diagram and the connections of your setup?

I’m so wonder why the boot up behavior is affected by the host PC.
Could it boot up if you don’t connect the board to host PC?

Currently, R32.7.4 and BSP, which fits to our hardware, is not available by the hardware vendor.

I am very sorry, but I am not allowed to send a full diagram of the hardware. I would appreciate if you could instruct me specific questions I could ask the hardware guys…

I was probably misunderstood regarding the host. When I mean host, I don’t mean a host PC, I mean another piece of embedded hardware which our board is connected to. This host supplies its power , connects to is busses (PCIe , USB , Ethernet …), and route its output (DP)…

It seems you are using the custom carrier board from another vendor.

It seems you are using another embedded module as host and using Jetson as client.

The hardware design is so much different from the devkit.

Could you reproduce the similar issue on the devkit?
Or I would suggest you asking for the help from your vendor, they may much know the custom design of your board.
In addition, there’re many errors messagess as I told you before should be fixed.

About this message, it will occur when the kernel log buffer is full and new messages are being generated.
You could increase it through configuring CONFIG_LOG_BUF_SHIFT in kernel config.

I am attaching a drawing, which would probably better explain the situation. (And why devkit is not an issue here and can’t play a real in the debug process).

The carrier board’s vendor is currently not involved, as its product has no problem.

Its carrier board works in its JIG and in another configuration of ours.

What I am trying to find out is, what could prevent the Linux from booting properly in our not working configuration. (Or at least someone who could guide me with a proper way how this could be debugged).

Many error messages are also seen in the working configuration.

I am trying to find out what is the “deal breaker” for the L4T, which prevents it from finish its boot. (And of course this is our first step. After this problem is resolved, we work in order to fix all other error messages).

What’s the difference between your “Product Box 1” and “Product Box 2”?
Is there any hw design difference or using the different method to flash the board?

Hi Kevin,
First of all, I would like to thank you keeping up with this thread :)

There are some differences with the actual “Other Cards” connected to the boards and thus to the overall PCIe connectivity and DP connectivity. (there is a PCIe switch connecting the NXs and some of the “Other Cards”)
Flashing and booting the boards is always the same. Boot is performed from the the internal eMMC of each one of the NXs , and flashing is performed using a DIP switch selector on the vendors carrier card (to choose the correct NX) and the recovery signal and USB.

This may be unrelated, but I want to add some comments…

Any carrier board which has a different layout (e.g., some pins of the module have multiple possible functions) than the dev kit will need a different device tree to set up that pin layout (that device tree will differ from the default tree based on the pins which have different function). It is quite easy for a minor issue in the device tree to disable some hardware (perhaps hardware used in boot).

If security fuses are not burned, then device tree and kernel content of an eMMC model can be taken from the signed partitions. On the other hand, if device tree and/or kernel are named in extlinux.conf, then the files named take precedence. Make sure you know which tree is being used, and that the tree is the one you expect for the required wiring layout of that board.

The initrd can complicate this. This is basically a very tiny Linux operating system using the outside kernel (from “/boot” or the signed partition), but it has a minimal init (systemd for most Ubuntu), and it also has needed kernel modules. Those modules in turn must match the kernel which is being used (some modules won’t load if they are compiled against the same kernel, but a different configuration). Make sure you know which kernel is being used, and which modules are required for the moment the pivot root transfers from the initrd ramdisk to the main storage (e.g., NVMe, USB drive, so on).

Often boot logs will provide that information if logging is enabled. Note that logging might not be enabled within an initrd, but if not, then you can often look at messages before and after the initrd and see what is going on. There may also be device tree changes within the initrd.

Is this log coming from one of your Xavier NX on “Vendor’s Carrier board”?
How about other Xavier NX? Does all of them have the similar serial console logs?

and this seems the warning message to inform you the kernel log buffer full rather than the error message.

This is coming from one of the NXs. I am waiting to have access again to the system so I would have the ability to check the other ones and to try to put the hardware in the working JIG again and search for more logs. I will update/ask more when I have more data….

You are probably right

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.